# Intro to Data Manipulation and Pandas
**Dan Tamayo**

*Material draws from a [blog](http://www.ultravioletanalytics.com/2014/10/30/kaggle-titanic-competition-part-i-intro/) working through the Titanic dataset by Dave Novelli, as well as the [Pandas Cookbook](http://pandas.pydata.org/pandas-docs/stable/tutorials.html) by Julia Evans*


# Follow Along

In a terminal, navigate to the directory where you want to add the MachineLearning folder.  Then

    cd /path/to/MachineLearning
    git pull
    cd Day2
    source activate ml
    jupyter notebook TitanicPandas.ipynb

# Loading a Dataset

Wrong way:

In [None]:
import numpy as np
df[df['Cabin']!=np.nan]

Right Way:

In [None]:
df.loc[df['Cabin'].notnull()]

# Approach 1:  Throw out any rows with missing data values

In [None]:
df.notnull().head()

In [None]:
df_filter = df.loc[df.notnull().all(axis=1)]

# How much data is left?

In [None]:
df_filter.shape

# Approach 2:  Assign missing identifier

In [None]:
df.loc[df['Cabin'].isnull(), 'Cabin'] = 'U0'
df.head()

# Approach 3: Assign the average/median/mode value

# Plotting

In [None]:
%matplotlib inline
import seaborn
fig = df['Pclass'].hist()

In [None]:
fig = df.hist(figsize=(15,5))

# Filtering

In [None]:
df.head()

# Boolean Masks

In [None]:
mask = df['Embarked'] == 'C'
mask.head()

# Filtering Dataframes

In [None]:
df_filter = df[mask] # df_filter = df[df['Embarked'] == 'C']
df_filter.head()

# Memory Issues

In [None]:
df_filter = df.loc[df['Embarked'] == 'C']
df_filter.head()

# Selecting specific columns

df.loc[criterion, columns]

# Approach 1:  Creating Dummy (Binary) Variables

In [None]:
print(df['Embarked'].unique())

In [None]:
dummies_df = pd.get_dummies(df['Embarked'])
dummies_df.head()

In [None]:
def addEmbarked(name):
    return 'Embarked_' + name
dummies_df = dummies_df.rename(columns=addEmbarked)
dummies_df.head()

In [None]:
df = pd.concat([df, dummies_df], axis=1)
df.head()

# Approach 2: Factorizing (Make single multi-class feature)

In [None]:
df['EmbarkedNum'] = pd.factorize(df['Embarked'])[0]
df.head(6)

# What are the classes?

In [None]:
pd.factorize(df['Sex'])

In [None]:
df['Female'] = pd.factorize(df['Sex'])[0]
df.head()

# Continuous Features: Feature Scaling

![Training data](images/04_knn_dataset.png)

# Requirements for working with data in scikit-learn

1. Features should not have missing values
2. Features and response should be numeric
3. Features and response should be NumPy arrays
4. Features and response are separate objects
5. Features and response should have specific shapes

# Requirements for working with data in scikit-learn

1. **Features should not have missing values**
2. Features and response should be numeric
3. Features and response should be NumPy arrays
4. Features and response are separate objects
5. Features and response should have specific shapes

# How to Find Valid Values

# Indexing

In [None]:
submit_df = pd.read_csv('data/test.csv', index_col=0)
submit_df.head()

In [None]:
input_df = pd.read_csv('data/train.csv', index_col=0)
submit_df = pd.read_csv('data/test.csv', index_col=0)
df = pd.concat([input_df, submit_df])
df.tail()

In [None]:
print(df.shape[1], "columns")
print(df.shape[0], "rows")
print(df.columns.values)

# Putting it all together

In [None]:
def process_data(df):
    df['Female'] = pd.factorize(df['Sex'])[0]
    df.loc[df['Age'].isnull(), 'Age'] = df['Age'].mean()
    df.loc[df['Fare'].isnull(), 'Fare'] = df['Fare'].median()
    df.loc[df['Cabin'].isnull(), 'Cabin'] = 'U0'
    df.loc[df['Embarked'].isnull(), 'Embarked'] = df['Embarked'].dropna().mode()[0]
    
    dummies_df = pd.get_dummies(df['Embarked'])
    def addEmbarked(name):
        return 'Embarked_' + name
    dummies_df = dummies_df.rename(columns=addEmbarked)
    df = pd.concat([df, dummies_df], axis=1)
    df['EmbarkedNum'] = pd.factorize(df['Embarked'])[0]
    
    return df

In [None]:
df = process_data(df)
df.tail()

In [None]:
features = ['Age', 'Fare', 'Parch', 'Pclass', 'SibSp', 'Female', 'EmbarkedNum']
df_test = df.loc[df['Survived'].isnull(), features]
df_train = df.loc[df['Survived'].notnull(), features+['Survived']]
df_train.head()

# Requirements for working with data in scikit-learn

1. Features should not have missing values
2. Features and response should be numeric
3. Features and response should be NumPy arrays
4. **Features and response are separate objects**
5. **Features and response should have specific shapes**

In [None]:
X_train = df_train[features].values
y_train = df_train['Survived'].values
print(X_train[0:5])
print(y_train[0:5])
print("X has {0} rows".format(X_train.shape[0]))
print("y has {0} rows".format(y_train.shape[0]))

# Feature Engineering

# Parsing Alphanumeric Features

In [None]:
test = df.loc[df['Age'] > 30., ['Age', 'Fare', 'Sex']]
test.head()

# Combining criteria / columns

In [None]:
test = df.loc[(df['Age'] > 30.) & (df['Fare'] < 50.), 'Age':'Fare']
test.head()

In [None]:
input_df = pd.read_csv('data/train.csv')
submit_df = pd.read_csv('data/test.csv')

print(input_df.shape)
print(submit_df.shape)

In [None]:
submit_df.head()

In [None]:
input_df.head()

In [None]:
df[['Fare', 'Sex']].head()

# Data at a glance

In [None]:
df['Sex'].value_counts()

In [None]:
df['Age'].median()

In [None]:
df.loc[df['Fare'].isnull()].shape

In [None]:
df['Fare'].median()

In [None]:
df.loc[df['Fare'].isnull(), 'Fare'] = df['Fare'].median()

# Categorical Variables

In [None]:
df.loc[df['Embarked'].isnull()].shape

In [None]:
df['Embarked'].mode()

In [None]:
df['Embarked'].mode()[0]

In [None]:
df.loc[df['Embarked'].isnull(), 'Embarked'] = df['Embarked'].dropna().mode()[0]

# Approach 4: Fit a regression model to predict missing values

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(df[['Age', 'Fare']]) # .fit(df) to scale all numerical columns
print("Means = {0}".format(scaler.mean_))
print("Stdevs = {0}".format(scaler.scale_))
df[['Age', 'Fare']] = scaler.transform(df[['Age', 'Fare']])
df.head()

# Requirements for working with data in scikit-learn

1. Features should not have missing values
2. Features and response should be numeric
3. **Features and response should be NumPy arrays**
4. Features and response are separate objects
5. Features and response should have specific shapes

# Underneath, Pandas series are numpy arrays

In [None]:
df['Cabin'].unique()

![TitanicPlans](images/titanicplans.jpg)

# Regular Expressions

In [None]:
import re
def getDeck(cabin):
    match = re.search("([A-Z])", cabin)
    return match.group(1) if match is not None else None
def getCabinNum(cabin):
    match = re.search("([0-9]+)", cabin)
    return match.group(1) if match is not None else None
print(getDeck('C237'))
print(getCabinNum('C237'))

![regex](images/regular_expressions_cheat_sheet.png)

# Apply a function to all rows to generate a new feature

In [None]:
for col in df.columns:
    print("NaNs in column {0} = {1}".format(col, df.loc[df[col].isnull()].shape[0]))

In [None]:
df.loc[df['Age'].isnull(), 'Age'] = df['Age'].mean()

# Requirements for working with data in scikit-learn

1. Features should not have missing values
2. **Features and response should be numeric**
3. Features and response should be NumPy arrays
4. Features and response are separate objects
5. Features and response should have specific shapes

In [None]:
fares = df['Fare'].values
type(fares)

# Getting a pipeline ready for sklearn

# Combining Data Frames

In [None]:
df['Deck'] = df['Cabin'].map(getDeck)
df['CabinNum'] = df['Cabin'].map(getCabinNum)
df.head()

In [None]:
df['CabinNum'].isnull().value_counts()

In [None]:
df.loc[df['CabinNum'].isnull(), 'CabinNum'] = 0

In [None]:
df['Deck'].isnull().value_counts()

In [None]:
df['DeckNum'] = pd.factorize(df['Deck'])[0]

# What to do with the name?

# Number of names