# <center> DATA SCIENCE WITH SOCIAL SCIENCE DATA <br/><br/> CSCAR WORKSHOP <br/><br/> 04/06/2017
## <center> Marcio Mourao and Jeff Lockhart


# <center> Setup for Anaconda / Jupyter Notebook

<ul>
    <li>Go to the page https://marcio-mourao.github.io/</li>
    <li>Download the materials under "Social Data Science - Part IV" to your "username/Documents"</li><br/>
    
    <li>Click the Windows button (Bottom Left Corner)</li>
    <li>Click "All apps"</li>
    <li>Click "Anaconda3 (64-bit)"</li>
    <li>Click "Anaconda Prompt" </li>
    <ul>
        <li>Enter "conda update pandas"</li>
        <li>Enter "conda update matplotlib"</li>
        <li>Enter "conda update scikit-learn"</li>
    </ul><br/>
    
    <li>Click the Windows button (Bottom Left Corner)</li>
    <li>Click "All apps"</li>
    <li>Click "Anaconda3 (64-bit)"</li>
    <li>Click "Jupyter Notebook" </li><br/>
    
    <li>Click "Workshop.ipynb" (this should open a new tab in the browser)</li>
</ul>

# <center> Introduction

<ul>
  <li>Please, sign up the sheet! </li>
  <li>Don't forget to go to: http://cscar.research.umich.edu/ to know what we're offering!</li>
</ul>

# <center> Summary of this workshop

<ul>
  <li>Review on simple and structured Python data types</li>
  <li>Creating dataframes, describing and looking for 'missing values' in data</li>
  <li>Indexing and slicing</li>
  <li>Apply functions, groupby and sort data</li>
  <li>Visualization</li>
  <li>Logistic Regression</li>
  <li>Random Forests</li>
</ul>



# <center> References

<ul>
  <li>https://www.continuum.io/anaconda-overview</li>
  <li>http://www.numpy.org/</li>
  <li>http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html</li>
  <li>http://matplotlib.org/</li>
  <li>http://pandas.pydata.org/pandas-docs/stable/10min.html</li>
  <li>http://scikit-learn.org/stable/</li>
  <li>http://statsmodels.sourceforge.net/</li>
</ul>

In [None]:
#Check Python version
import sys
print(sys.version)

## Import relevant modules

In [None]:
%matplotlib inline
import numpy as np
#from numpy import * #another way of importing but I prefer the above
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
#Check numpy and pandas version
print(np.__version__)
print(pd.__version__)

## Python Simple Data Types
##### Integers
##### Floats
##### Strings
##### Booleans

## Python Data Structures

### Lists

In [None]:
example_list = [2,4,'fg',8,[3,4]]
print(example_list)
print(example_list[0])
print(example_list[2:4])
print(example_list[-2])
example_list[2]=20
print(example_list)
print(example_list[4][0])

### Tuples

In [None]:
example_tuple = (2,4,6,8,10)
print(example_tuple)
print(example_tuple[1])
#example_tuple[2]=20 #this should produce an error

### Dictionary

In [None]:
example_dictionary = {'A':20,'B':40,'C':60}
print(example_dictionary)
print(example_dictionary['B'])
example_dictionary['C']=100
print(example_dictionary)
print(example_dictionary.keys())
print(example_dictionary.values())

### Numpy arrays

In [None]:
example_array = np.array([2,4,'dfg',8,10])
print(example_array)
print(example_array[0])
print(example_array[2:4])
print(example_array[-2])
example_array[2]=20
print(example_array)

### Pandas Series, a one dimensional labeled array

In [None]:
example_dictionary = {'A':20,'B':40,'C':60}
example_series = pd.Series(example_dictionary)
print(example_series)
print(example_series[0])
print(example_series['B':])

### Pandas Dataframes, a two-dimensional labeled data structure with columns of potentially different types

In [None]:
d=[['df',1.0],
   ['as',3],
   ['bq',5]]
example_series = pd.DataFrame(d,index=['Row1','Row2','Row3'],columns=['Column1','Column2'])
print(example_series)
example_series.dtypes

# <center> Description of the data

This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year.

Attributes:

>50K, <=50K

age: continuous

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked

fnlwgt: continuous

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool

education-num: continuous

marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse

occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces

relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black

sex: Female, Male

capital-gain: continuous

capital-loss: continuous

hours-per-week: continuous

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands

## Create a dataframe, describe and look for missing data

In [None]:
#Displays the signature of the function
?pd.read_csv

In [None]:
#Creates a dataframe named "adults" from reading the file "adult.csv"
adults = pd.read_csv('adult.csv',na_values=['?'])
adults.head()

In [None]:
#Displays number of lines and number of columns of the dataframe
adults.shape

In [None]:
#Convert object dtypes to category dtypes
adults[adults.select_dtypes(['object']).columns] = adults.select_dtypes(['object']).apply(lambda x: x.astype('category'))
adults.head()

In [None]:
#Displays the data types associated with each dataframe column
adults.dtypes

In [None]:
#Display categories for the 'workclass categorical
adults.workclass.cat.categories

In [None]:
#Describes the dataframe
adults.describe()

In [None]:
#Describes everything in the dataframe
adults.describe(include='all')

In [None]:
#Displays whether columns contain any null values
adults.isnull().any(axis=0)

In [None]:
#Count the number of missing values in each column of the dataframe
adults.apply(lambda x: sum(x.isnull()),axis=0)

In [None]:
#Count the number of missing values in each column of the dataframe and sums them up
adults.apply(lambda x: sum(x.isnull()),axis=0).sum()

In [None]:
#Count number of lines with NaNs
adults.apply(lambda x: x.isnull().any(),axis=1).sum()

In [None]:
#Fraction of observations with NaNs (potentially for removal)
2399/adults.shape[0]

In [None]:
#Removes any lines from the dataframe that contains NaNs 
adults=adults.dropna(axis=0,how='any')
adults.head()

In [None]:
#Displays number of lines and number of columns of the dataframe
adults.shape

In [None]:
#Just confirming the removal was successful
30162+2399

## Indexing and Slicing

In [None]:
#Displays the first rows of the dataframe
adults[:5]

In [None]:
#Displays the first rows of the dataframe
adults.head()

In [None]:
#Displays the last rows of the dataframe
adults[::-1].head()

In [None]:
#Same as above, but returns a numpy array of values
adults[::-1].head().values

In [None]:
#From the dataframe, retrive rows in position (integer based index) 2 and 3 and columns in location 0
adults.iloc[2:4,0]

In [None]:
#From the dataframe, retrieve rows with labels 2, 3 and 4 and column 'TV'
adults.loc[2:4,'education']

In [None]:
#From the dataframe, retrive rows with labels 2, 3 and 4 and columns 'TV' and 'Sales'
adults.ix[2:4,['age','education']]
#advertising.ix[2:4,[0,3]] #Another way of getting the slice, but I prefer the above

In [None]:
#Retrieves a boolean series with True values wherever age is lower than 50
adults['age']<50

In [None]:
#Returns a subsection of the dataframe where age of adults are lower than 50
adults[adults['age']<50]
adults.head()

## Apply functions, Groupby and Sort data

In [None]:
#Returns the mean of each one of the numerical columns on the dataframe
adults.mean(axis=0)

In [None]:
#Returns the mean of hours per week
adults['hours.per.week'].mean()

In [None]:
#Returns the mean and the std for each of the selected columns (age and # education years)
adults[['age','education.num']].apply(lambda x: ([x.mean(),x.std()]),axis=0)

In [None]:
#Groups the data by 'income and obtains the mean of each column for each group
adults.groupby('income').count()

In [None]:
#Groups the data by 'native.country and obtains only the average 'hours.per.week
adults[['native.country','hours.per.week']].groupby('native.country').mean()

In [None]:
#Groups the data by 'native.country and obtains the max 'hours.per.week
adults[['native.country','hours.per.week']].groupby('native.country').max()

In [None]:
#Groups the data by 'income and 'marital status', and obtains the mean of each column for each group
adults.groupby(['income','marital.status']).mean()

In [None]:
#Groups specified data by 'sex and 'income, and obtains a mean for the # education years
adults[['income','sex','education.num']].groupby(['sex','income']).mean()

In [None]:
#Sorts the data by age and # education years in a specified order
adults.sort_values(by=['age','education.num'],ascending=[True,False]).head(10)

## Visualization

In [None]:
#Creates a histogram of 'age' and 'education.num' and 'hours.per.week' on the dataframe
adults.hist(column=['age','education.num','hours.per.week'],grid='off')

In [None]:
#Creates a histogram for 'age' grouped by 'income'
adults.hist(column='age',by='income')

In [None]:
#Creates a histogram of 'age' grouped by 'income'
adults.groupby('income').hist(column='age',grid='off')

In [None]:
#Displays the histograms of 'age' grouped by 'income' in the same plot
plt.rcParams.update({'font.size': 20})
plt.figure()
adults.groupby('income').age.hist(alpha=0.5)
plt.legend(labels=['<=50K','>50K'],loc='best')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.grid('off')

## Machine Learning

In [None]:
#Import modules
from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics

In [None]:
#Creates a label encoder object
le=LabelEncoder()

adults2=adults.copy()

#Defines an encoder function that is applicable to all columns
def my_encoder(col):
    adults2[col.name + '_enc']=le.fit(col.values).transform(col.values)

#Apply encoder to all object columns
adults2[adults.select_dtypes(include=['category']).columns.values].apply(my_encoder,axis=0)
adults2.head()

In [None]:
#Check new data types
adults2.dtypes

In [None]:
#Define covariates in X and dependent variable in y
X = adults2[['age','workclass_enc','education.num','marital.status_enc','occupation_enc',
            'race_enc','sex_enc','relationship_enc','capital.gain','capital.loss',
            'hours.per.week','native.country_enc']]
y = adults2.income_enc

In [None]:
#Obtain the data for the fitting
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=13)

print('Total number of records: ', adults2.shape[0])
print('Type of X_train: ', type(X_train))
print('Number of records in X_train: ', len(X_train))
print('Fraction on X_train: ', len(X_train)/adults2.shape[0])
print('Number of records in y_train: ', len(y_train))
print('Type of y_train: \n\n', type(y_train))

print('Type of X_test: ', type(X_test))
print('Number of records in X_test: ', len(X_test))
print('Fraction on X_test: ', len(X_test)/adults2.shape[0])
print('Number of records in y_test: ', len(y_test))
print('Type of y_test: ', type(y_test))

### Logistic Regression

In [None]:
#Creates a logistic regression object
logreg_model = LogisticRegression()

#Fit to the data
logreg_model.fit(X_train, y_train)

print('Intercept: \n', logreg_model.intercept_)
print('Coefficients: \n', logreg_model.coef_)

In [None]:
#Obtain class predictions
y_pred_logreg_prob = logreg_model.predict_proba(X_test)
print('Predicted probabilities: \n', y_pred_logreg_prob)

#Obtain probability predictions
y_pred_logreg_class = logreg_model.predict(X_test)
print('Predicted classes: \n', y_pred_logreg_class)

In [None]:
#Obtains accuracy score
print('LogReg score: ', metrics.accuracy_score(y_test, y_pred_logreg_class))

In [None]:
#Obtains confusion matrix
LogReg_CM=metrics.confusion_matrix(y_test,y_pred_logreg_class)
LogReg_CM

In [None]:
#Confirming accuracy
print('Number of elements in the test set: ', len(y_test))
print('Accuracy: ', (LogReg_CM[0,0]+LogReg_CM[1,1])/len(y_test))

In [None]:
#KFolds and Cross_val_scores
kf = KFold(n_splits=10, shuffle=True)
print('Cross validation score: ', cross_val_score(logreg_model, X, y, cv=kf).mean())

### Random Forest

In [None]:
#Creates a RF classification model
RF_model = RandomForestClassifier(n_estimators=10, criterion='gini')

#Fit to the data
RF_model.fit(X_train, y_train)

In [None]:
#Obtain class predictions
y_pred_RF_prob = RF_model.predict_proba(X_test)
print('Predicted probabilities: \n', y_pred_RF_prob)

#Obtain probability predictions
y_pred_RF_class = RF_model.predict(X_test)
print('Predicted classes: \n', y_pred_RF_class)

In [None]:
#Obtains accuracy score
print('RF Score: ', metrics.accuracy_score(y_test, y_pred_RF_class))

In [None]:
#Obtains confusion matrix
RF_cm=metrics.confusion_matrix(y_test,y_pred_RF_class)
RF_cm

In [None]:
#Confirming accuracy
print('Number of elements in the test set: ', len(y_test))
print('Accuracy: ', (RF_cm[0,0]+RF_cm[1,1])/len(y_test))

In [None]:
#Capture feature importance from the RF model
feature_imp=RF_model.feature_importances_

#Create plot of feature importance
positions = np.arange(12)
plt.barh(positions, feature_imp, align='center')
plt.xlabel("Feature Importances")
plt.ylabel("Features")
plt.yticks(positions, ('Age','Working Class', 'Years Education', 'Marital Status', 'Occupation',
                       'Race', 'Sex', 'Relationship Status', 'Capital Gain', 'Capital Loss',
                       'Hours per Week','Native Country'))
plt.grid(True)


In [None]:
#KFolds and Cross_val_scores
kf = KFold(n_splits=10, shuffle=True)
print('Cross validation score: ', cross_val_score(RF_model, X, y, cv=kf).mean())