# Model Making

### Splitting:
``` Python
from sklearn.model_selection import train_test_split 
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)  # Splits data into training/validation data
```

### Model types:
```Python
from sklearn.tree import DecisionTreeRegressor # Decision Tree
from sklearn.ensemble import RandomForestRegressor  # Random forest model

```
#### Standard procedure:
```Python
model.fit(train_X,train_y)
prediction = model.predict(val_X)
```

### Model Scoring:
```Python
from sklearn.metrics import mean_absolute_error
mean_absolute_error(val_y, val_predictions)
```

### Imputation:
```Python
from sklearn.impute import SimpleImputer
# Imputation
my_imputer = SimpleImputer()                 # can specify method here, "median", "mean", "most_frequent", “constant”,
                                             # if "constant" then choose value with fill_value= num
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns    # These are the new sets you use for the model

```
An extension to imputing is include a column saying True/False for missing values, and use this with the imputed data

# Seaborn

```Python
import seaborn as sns
import matplotlib as plt

sns.set()  # Sets the graph theme to the seaborn style!
sns.set_style("dark") # 5 themes: (1)"darkgrid", (2)"whitegrid", (3)"dark", (4)"white", and (5)"ticks"
```

https://www.kaggle.com/alexisbcook/choosing-plot-types-and-custom-styles

### Lineplot:
``` Python
sns.lineplot(data=dat_data['####'], label="####")
```

### Barplot:
``` Python
sns.barplot(x=dat_data.index, y=dat_data['###'])                   # remember the index part for the x axis!!
```

### Heatmap: 
annot lets you show the value in the heatmap
``` Python
sns.heatmap(x=dat_data.index, annot=True)
```

### Scatterplot:
Hue lets you colour the different data
``` Python
sns.scatterplot(x=dat_data['###'], y=dat_data['###'],hue=dat_data["ABC"])
```
### Regressionplot:
Just a scatter plot with a line of best fit
``` Python
sns.regplot(x=dat_data['###'], y=dat_data['###'])

sns.lmplot(x="ABC", y="DEF", hue="HIJ", data=dat_data,hue=)        # For separate lines of best fit

sns.swarmplot(x=dat_data['ABC'],
              y=dat_data['DEF'])              # Lets you have separate scatterplots on the same graph, shown vertically
```

### Histograms
`a` tells us what column, and making kde true gives us a different plot
``` Python
sns.distplot(a=dat_data["ABC"], kde=False, bins=)                  # can specify number of bins with bins=
```

### Density plots - KDE for Kernel Density Estimate
`shade` shades in the area underneaththe density graph
``` Python 
sns.kdeplot(data=dat_data['ABC'], shade=True)
```
### 2D KDE plots
``` Python
sns.jointplot(x=dat_data["ABC"], y=dat_data["DEF"], kind="kde")
```

------
-----

# Pandas

In [4]:
import pandas as pd

### Series:

In [2]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')


2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

```Python
.shape                              # tells us the size of the data
.dtype / dtypes                     # No brackets!!! just returns the datatype of column / datatypes in every entry
.astype()                           # Lets you convert the datatype! examples to str, float64, int64, etc
.describe()                         # gives a summary of the data
.unique()                           # gives us all the unique values 
.value_counts()                     # shows us how often values occur sorted largest first
.map()/apply()                      # Applys a function to all items in a column (map) / whole dataframe (apply)
.idxmax()                           # returns index with the maximum value
.replace("a","b")                   # lets you change entries from one to another
.fillna()                           # Replaces all empty entries, method=bfill fills it with the next data in column
.drop(,axis=)                       # Lets you drop rows/columns, choose with axis=
pd.read_csv                         # opens the csv
pandas.DataFrame.to_csv(obj,name)   # saves the csv

``` 

### Indexing:

```Python
data[data.country.isin["ABC","DEF"]]              #  isin Lets you index categorically

data[data.country.isnull["ABC","DEF"]]            #  isnull Lets you index empty entries (also notnull())

cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]   # indexing columns containing empty entries
```

### grouping/sorting: https://www.kaggle.com/residentmario/grouping-and-sorting

can sometimes return multi-indexes, run .reset_index() to reset it

``` Python
.groupby(CATEGORY)                      # Groups the entries that are identical in a column.
 -  example: reviews.groupby('points').price.min()  # gives you a minimum price for each point value
 -  reviews.groupby('points').size()                # similar to value_counts() but unsorted

.agg([len, min, max])                               # lets you run a list of functions simultaneously


.sort_values(by='len', ascending=False) # lets you sort values by a specific column, can sort by multiple columns!
```

### renaming

``` Python
.rename(columns/index={"a":"b"})              # changes the name of columns/index
.rename_axis("a",axis="rows"/"columns")       # lets you change the name of the axis
```

### combining

``` Python  
pd.concat([a,b])                            # Lets you smush DataFrames together, useful when they have the same columns

.join(data, lsuffix='_1', rsuffix='_2')     # Lets you combine DataFrames which have an index in common.
                                            # The lsuffix and rsuffix lets you add names added on to separate columns
```

----
----

# Scaling and Normalization


``` Python
# for min_max scaling (SCALING)
from mlxtend.preprocessing import minmax_scaling

# for Box-Cox Transformation (NORMALIZATION)
from scipy import stats
```        

 - **scaling** -  changing the range of your data
 - **normalization** - changing the shape of the distribution of your data to a bell curve

### Scaling:

When using methods that base on measuring how far away data points are:
 -  support vector machines (SVN)
 -  k-nearest neighbors (KNN)
 ```Python
scaled_data = minmax_scaling(original_data, columns=["ABC"])      # scales between 0-1
```

### Normalization

When using methods that assume your data is normally distributed:
-  linear discriminant analysis (LDA)
-  Gaussian naive Bayes
<br>
Anything that says Gaussian most likely assumes normally distributed data

```Python
normalized_data = stats.boxcox(original_data)
```

-----
-----

# Data Cleaning

```Python
# how many total missing values do we have?
total_cells = np.product(dat_data.shape)          #number of all cells
total_missing = dat_data.isnull().sum()
# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
```

### All null functions:

```Python
.fillna()                           # Replaces all empty entries
                                    method= "bfill", axis = 0  # fills it with the next data in column
.dropna()                           # Removes any row/column with empty entries, specify with axis=
.isnull()                           # Lets you index empty entries  .isna()
.notnull()                          # Lets you index non empty entries

```

## Date conversion

#### Note this onlyworks if all the dates are in the same format
```Python
data["Dates"] = pd.to_datetime(data["Dates"], format="%m/%d/%y") 
```
3/2/07 $\rightarrow$ 2007-03-02

We can extract information:
```Python
data["Dates"].dt.day                 # returns a list of the dates
```

Checking if dates are all the same format:
```Python
date_lengths = data.Dates.str.len()
date_lengths.value_counts()

returns something like :
10    23409
24        3
Name: Date, dtype: int64
```


Could be useful:
```Python
pandas.read_csv(parse_dates=True)
```

## Character encoding:

utf-8 is the standard str encoding for saving in pandas!
```Python
import chardet                            # character encode module
text.decode("big5-tw")                    # decodes a string for python to print
text.encode("utf-8", errors="replace")    # this lets you encode to utf-8
```

If an error occurs when opening a file as utf-8, then the file is encoded in something else:
```Python

with open("data.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))           # increase this value if it returns the wrong encoding
    
    # searches for all possible encodings
pd.read_csv("data.csv", encoding="")
```

## Inconsistent Data Entry


For example, South Korea is written as southkorea, south korea and South Korea. we can clean this up

```Python
import fuzzywuzzy                  # used to identify which strings are closest to each other
from fuzzywuzzy import process
import chardet                     # for character encoding
```

```Python
.str.strip()                           # removes spaces at the start and end of the string
.str.lower/upper()                     # converts string to lower/upper case
```

```Python
matches = fuzzywuzzy.process.extract("south korea", countries, limit=5, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
```

This complicated function returns something like: Just a score for the strings closest to "south korea"
```Python
[('south korea', 100),
 ('southkorea', 48),
 ('saudi arabia', 43),
 ('norway', 35),
 ('austria', 33),
 ('ireland', 33),
```

Replacing the strings knowing the threshold is 48:
```Python
def replace_matches_in_column(df, column, string_to_match, min_ratio = 47):
    # get a list of unique strings
    strings = df[column].unique()
    
    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    # only get matches with a ratio > 90
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

    # get the rows of all the close matches in our dataframe
    rows_with_matches = df[column].isin(close_matches)

    # replace all rows with close matches with the input matches 
    df.loc[rows_with_matches, column] = string_to_match
    
```