# Women in Data Science Datathon Project
The Women in Data Science Datathon is inspired by research to help disadvantaged women participate more fully in their local economies.  The dataset is created from a survey in India regarding access to financial tools and digital technology.  In this Kaggle competition, we will predict the gender of the survey respondents based on their answers.  We will experiment with different ways of cleaning the data and various classification models in order to get the highest accuracy score.  While this iPython notebook ends at this step, we will then use our best model to predict the test data and then submit the results to Kaggle to be scored online.

In [11]:
import pandas as pd
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import linear_model

## Cleaning Data
After making our imports, we will clean the data.  First, we will drop columns with all empty values.  Next, we need to separate the columns into categoric and numeric factors.  For the categoric factors, we will replace empty values with 0 to create a null group and then create dummy/indicator variables.  For the numeric factors, we will replace empty values and non-answers with the median value of the column.

In [12]:
df = pd.read_csv("C:/train.csv")
Y = df['is_female']
X = df.drop(['is_female','train_id'], axis = 1)
X = X.dropna(axis=1, how='all') # If all values are nans, drop col

categoricColumns = X.columns.tolist()
numericColumns = []

for i in categoricColumns:
    if ("MT17_" in i or "FF16_" in i or "MM5_" in i or "FF9" in i
         or "MM8_" in i or "MM9_" in i or "MM17_" in i or "MM32_" in i
         or "MM42_" in i or "MMP2_" in i or "MMP4_" in i or "IFI4_" in i
         or "IFI2_" in i or "IFI14_" in i or "IFI15_" in i or "IFI17_" in i
         or "FL3" in i or "FL8_" in i or "LN1A" in i or "LN1B" in i
         or "LN2_1" in i or "LN2_2" in i or "LN2_3" in i or "LN2_4" in i):
        numericColumns.append(i)
        categoricColumns.remove(i)

for i in numericColumns:
    median = X[X[i].between(1, 20, inclusive=True)][i].median()
    X[i] = X[i].fillna(median) 
    X[i].replace([96, 99], median)

for i in categoricColumns:
    X[i] = X[i].fillna(0) 
    X = pd.get_dummies(X, columns=[i])
    


## Classification Models
Now the data has been cleaned, we will split our data into an 80/20 training and testing set.  Next, we will run four classification models: Logistic Regression, Random Forest, Gradient Boosting, and AdaBoost with decision tree classifier.  Logistic Regression and Random Forest gives us a baseline for what accuracies we can expect.  However, we expect to have the best results with our ensemble models.

In [13]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.2, random_state = 0)

# Logistic Regression
LogReg = linear_model.LogisticRegression()
LogReg.fit(X_train, y_train)
y_pred = LogReg.predict(X_test)
print('Logistic Regression: '+str(accuracy_score(y_test, y_pred)))

# Random Forest
clf = RandomForestClassifier(n_jobs=2, random_state=1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Random Forest: '+str(accuracy_score(y_test, y_pred)))

# Gradient Boosting
clf = GradientBoostingClassifier(n_estimators=5000,
                                 learning_rate=0.5,
                                 max_depth=None, 
                                 random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Gradient Boosting: '+str(accuracy_score(y_test, y_pred)))

# AdaBoost
clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=3),
                         algorithm="SAMME",
                         n_estimators=1000,
                         learning_rate=1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('AdaBoost: '+str(accuracy_score(y_test, y_pred)))

Logistic Regression: 0.761904761905
Random Forest: 0.833333333333
Gradient Boosting: 0.857142857143
AdaBoost: 0.904761904762


## Conclusion
Our results were close to what we expected with our ensemble classifiers outperforming our simple classifiers with AdaBoost being our best model.  Next, we used AdaBoost to produce the predictions that we submitted to Kaggle for the competition.  The preliminary results have us at 91.444% accuracy.  However, this score only considers 16% of the testing set and the final results will be released after the competition is over.  Somethings that may improve our model include cleaning the data differently, trying other models and tuning parameters in our models.