# Data Camp - Preprocessing Data for Machine Learning

Data preprocessing
- Beyond cleaning and exploratory data analysis
- prepping data for modeling
- python ML modeling requires numerical input

Understand data with:
- df.head()
- df.columns()
- df.dtypes()
- df.describe()

First step in preprocessing-remove missing data
- df.dropna() to drop all rows with a missing value
- df.drop([1,2,3]) to drop specific rows by index label
- df.drop('col label', axis=1) to drop columns
- df.dropna(axis=1, thresh=# of null to allow) axis 0 = row, axis 1 = col

Filter dataframe based on values
- df[df['col'] == x]

Got a count of null values in a column then create a df that has those rows removed where the specified col has a null value
- df['col'].isnull().sum()
- df[df['col].notnull()]


In [None]:
#drop features/columns that have at least 3 missing values.
volunteer.dropna(axis=1, thresh=3)

# Check how many values are missing in the category_desc column
print(volunteer['category_desc'].isnull().sum())

# Subset the volunteer dataset by indexing by where category_desc is notnull()
volunteer_subset = volunteer[volunteer['category_desc'].notnull()]

# Print out the shape of the subset
print(volunteer_subset.shape)

#### Working with Data Types

df.dtypes
- most commonly used in pandas are
- object - string/mixed
- int64 - integer
- float64 - float
- datetime64 (timedelta) - datetime

converting column types
- df['col'] = df['col'].astype("float")

In [None]:
# Print the head of the hits column
print(volunteer["hits"].head())

# Convert the hits column to type int
volunteer["hits"] = volunteer['hits'].astype('int')

# Look at the dtypes of the dataset
print(volunteer.dtypes)

#### Training and Test Sets

function splits 75% into training and 25% into testing sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y)

if you have imbalanced classes, stratified sampling, takes into account the distribution of classes in dataset
ex: if data set contains 100 samples, 80 class 1, 20 class 2
    - we want training set to contain 75 samples, 60 class 1 / 15 class 2
    - test set containing 25 samples, 20 class 1, 5 class 2
    
can use the stratify parameter in train_test_split:
 - stratify = y
 - check value_counts
 
Code example below - We know that the distribution of variables in the category_desc column in the volunteer dataset is uneven. If we wanted to train a model to try to predict category_desc, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.

In [None]:
# Create a data with all columns except category_desc
volunteer_X = volunteer.drop("category_desc", axis=1)

# Create a category_desc labels dataset
volunteer_y = volunteer[["category_desc"]]

# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)

# Print out the category_desc counts on the training y labels
print(y_train["category_desc"].value_counts())