# Introduction to Data Science

Working with data is a process consisting of the following steps:
1. Acquiring data
2. Data Cleaning
3. Exploring Data
4. Building a model
5. Presenting the data

There are lots of cases with more or less steps, this tutorial walks through the baseline case - a general example.

[beautiful tutorial](https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce)

[just a lovely walk through by the same person, really inviting](https://nbviewer.jupyter.org/github/Tanu-N-Prabhu/Python/blob/master/Top_Python_Libraries_Used_In_Data%C2%A0Science.ipynb)



## Getting data

can be through web scraping (see libraries - beautifulsoup, requests, pandas) or download online (kaggle). not very interesting for noteable's purposes.

## Cleaning data

Library: pandas

BIG part of datasci

first get from raw to technically correct.

then get from technically correct to consistent.

steps to technically correct data include:
- removing unused/irrelevant columns/rows / cols with 0 variance
- renaming variables for consistency and meaning
- remove duplicates (if this doesnt cause loss of info)
- encode missing values appropriately (if required)
- fix dates (common bugbear)
- fix strings (another common bugbear)
- recoding vectors (eg: use 0 and 1 instead of yes/no)

then to consistent:
- variable constraints addressed (can age=-35?)
- dealing with outliers
- missing values - addressing strategies to get the most from data with missing values

functions:
read_csv
head(n) - view top n rows of data
tail() - bottom rows
drop() - eliminate cols
rename() - rename vars dict style
replace() - dictionary still refactor vectors

### on outliers

using the IQR score technique:

<code>seaborn.boxplot(x=df['colname'])
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 — Q1
print(IQR)
df = df[~((df < (Q1–1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
df.shape</code>

## Exploring data (EDA)

In reality you do this at the same time and iteratively alongside cleaning. 

NB: functions not normally in place:

df = df.drop_duplicates() not just
df.drop_duplicates()

use pandas functions:
read_csv - re
head() - you can look at the data with this
dtypes - attribute
shape - rows and cols attribute
duplicated() - duplicate rows
drop_duplicates() 
count() - number of rows
isnull() - get null vals
dropna()

see also:
seaborn - visualisation as sns - sns.set(color_codes=True)
matplotlib - visualisation

### Histograms

example from [here](https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce) :

<code>df.Make.value_counts().nlargest(40).plot(kind=’bar’, figsize=(10,5))
plt.title(“Number of cars by make”)
plt.ylabel(‘Number of cars’)
plt.xlabel(‘Make’);</code>

### Heatmap

seaborn is broken for some reason but that would be nice. can alternatively use matplotlib but it is a bit cliched at this point.

try bokeh 

see knn notebook

## Deciding on a model

This is when things actually kick off.

There's a lovely [cheat sheet](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) to help you decide.

## Build, fit, train, test

library: sklearn

## Performance

I stole [this snippet](https://medium.com/analytics-vidhya/building-a-machine-learning-model-to-predict-the-price-of-the-car-bc51783ba2f3). Ideas on functions and modules:

<code>from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score</code>

Consider different evaluators!

## Visualise Results

AKA the interesting bit.

This is a lovely way of plotting regression with best fit stolen from [here](https://medium.com/analytics-vidhya/building-a-machine-learning-model-to-predict-the-price-of-the-car-bc51783ba2f3):

<code>plt.figure(figsize= (6, 6))
plt.title(‘Visualizing the Regression using Lasso Regression algorithm’)
sns.regplot(pred, y_test, color = ‘teal’)
plt.xlabel(“New Predicted Price (MSRP)”)
plt.ylabel(“Old Price (MSRP)”)
plt.show()</code>

NB: here, MSRP is target.