# [Overview](https://github.com/summerela/python_data_analysis/blob/master/pandas_basics/notebooks/Data%20Analysis%20Overview.ipynb)

## First, Locate data: Kaggle, Scikit Learn Built in Datasets, UCI ML Repository
    

[Sklearn](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets)

[Getting Started](https://scikit-learn.org/stable/getting_started.html)

[ Markdown cheatsheet]( https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed )


In [6]:
# scikit-learn is an open-source ML library for un/supervised learning, many tools for model fitting, 
# data pre-processing, model selection, model evaluation ( train_test_split, cross_validate)
# you fit the model to data using fit method.
# ex: from sklearn.ensemble import RandomForestClassifier
#    clf = RandomForestClassifier(random_state=0)
#    (set x and y) // both are [] // x = samples, y = target values
#    clf.fit(X, y)
# USE the model: clf.predict([x,y])

### Types of Learning
- Supervised
    - classification 
    - regression ( desired output is >1 continous variables)
- Unsupervised
    - clustering (groups of similar examples within the data)
    - density estimation (distribution of data)

### [Scikit-learn has standard datasets](https://scikit-learn.org/stable/tutorial/basic/tutorial.html#machine-learning-the-problem-setting)
- iris and digits for classification
- boston hhouse prices for regression
> $ python
> from sklearn import datasets
> iris = datasets.load_iris()
> digits = datasets.load_digits()

### Working with dataset
- Load a dataset
- fit an **estimator** on the data so you can **predict** unseen samples
- an example of an estimator is svm (support vector classification) which has two arguments
    - the estimator is first fitted to the model (it has to learn from the model, by passing training set to fit method)
    - then we can predict using that trained model (predict method)



[Kaggle](https://www.kaggle.com/c/titanic)
- given 2 datasets: train.csv and test.csv
    - train.csv shows whether passenger survived: "ground truth"
    - test.csv doesn't include this ground truth info


### [UCI Learning Repository](https://archive.ics.uci.edu/ml/index.php)
- 488 data sets, searchable
- ex: adult.data, adult.names, adult.test
- https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

## Second,  [Data Munging: Prep the data](https://github.com/summerela/python_data_analysis/blob/master/pandas_basics/notebooks/Data%20Analysis%20Overview.ipynb)

80% of DA/Dta scientist time is spent on making data usable:
    
- Fixing formatting errors and misspelled words
- Restructuring and removing redundancies
- Flattening and grouping data
- Fixing errors
- Locating misssing entries
- Normalization
- Binning

## Third, Exploratory Data Analysis (EDA)
- view summary statistics such as mean, quantiles, variance, standard deviation
- Determine the type of distribution (normal, bimodal, skwed)
- Examine data types (numerical, continuous, caetgorical)
- Deal with missing values
- Determine potential relationships between your variables
- Identify and isolate outliers

## Select a Statistical Model
- To examine the relationship between continuous variables, **linear regression**
- To examine the relationship between binary/discrete variables (cat/dog, yes/no), **logistic regression**

## Check model assumptions
Each statistical model you choose has mathmatical and logical requirements that we must check exist before we can apply our model. For example, linear regression assumes the that the relationship between independent and dependent variables is linear. We can check this by:

- Visualizing our distributions
- Check for indepence of variables
- Plot relationships, such as in a scatterplot
- Each model has it's own assumptions, and you will need to learn the methods for checking those assumptions before continuing with your model of choice.

## Build the model

## Evaluate the model using its particular metrics
- Variance
- Confidence intervals
- Mean Squared Error
- Ordinary Least Squared
- AUC/ROC

## Finally, Interpret and conclusion
And finally, we do make of our analysis? Explain relationships found assumptions made, and support your interpretaiton of the results with evidence from your analysis. If there are issues, assumptions or areas that need further exploration, state those here, as well.

# Implementation
## [Intro](https://github.com/summerela/python_data_analysis/blob/master/pandas_basics/notebooks/Intro%20to%20Pandas.ipynb)

In [None]:
# import package
import pa