# MegaFon Accelerator

## Task 
To build a model that predicts which of the three segments (0,1,2) each person belongs to.

The contest_train.csv training sample consists of the following columns:  

    * ID - person's id
    * TARGET - segment corresponding to the person.
    * FEATURE_0…FEATURE_259 — person's characteristics.
   
The test sample contest_test.csv consists of an ID column followed by FEATURE_0 ... FEATURE_259.  
The prediction accuracy is assessed using the macro-f1_score metric.

### Tools 
The following libraries were used to build the model:
* numpy - A python library for working with arrays.
* pandas - High-level data representation and manipulation tool.
* matplotlib - A library for creating visualizations.
* seaborn - A library for making statistical graphics in Python.
* scipy -  A python library for scientific and mathematical purposes.
* statsmodels - A library for the estimation of many different statistical models.
* sklearn - A library to solve machine learning problems.
* imblearn - Imbalanced-learn toolbox.

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time
import seaborn as sns
import sklearn

In [10]:
import imblearn
from imblearn.over_sampling import SMOTE,RandomOverSampler
from imblearn.ensemble import BalancedRandomForestClassifier
from imblearn.under_sampling import RandomUnderSampler,NearMiss
from sklearn.utils import shuffle

In [11]:
from sklearn.model_selection import GridSearchCV,cross_val_score,train_test_split
from sklearn import feature_selection
from sklearn.linear_model import LassoCV,SGDClassifier,RidgeClassifier,Lasso,LogisticRegressionCV,Perceptron
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.manifold import TSNE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score,confusion_matrix
from sklearn.preprocessing import StandardScaler,RobustScaler,OneHotEncoder
from sklearn.decomposition import PCA

Function for writing prediction into csv file.

In [6]:
def write_sumb(name,prediction):
    pd.DataFrame({'ID':ID,'Predicted':prediction}).to_csv(name,index=False)

Function for cleaning data.

In [5]:
def cleaning_na(data):
    '''
    Function for cleaning data
    Drops columns with 40 or more percentage of missing values 
    Fillining 'nan' with median in numeric and most frequent in categoric column
    '''
    columns_with_na = np.where(data.isna().sum()>0)[0]
    columns_to_drop = (data.iloc[:,columns_with_na].isna().sum()/data.shape[0]*100)>40
    columns_number = columns_with_na[np.where((data.iloc[:,columns_with_na].dtypes=='float')|(data.iloc[:,columns_with_na].dtypes=='int'))[0]]
    columns_cat =  columns_with_na[np.where(data.iloc[:,columns_with_na].dtypes=='object')[0]]
    data.iloc[:,columns_number] = data.iloc[:,columns_number].apply(lambda x:x.fillna(x.median()))
    for column in columns_cat:
        data.iloc[:,column].fillna(data.iloc[:,column].mode()[0],inplace=True)
    data.drop(data.columns[columns_with_na[columns_to_drop]],axis=1,inplace=True)
    data.drop('ID',axis=1,inplace=True)

Function for cross validation different algorithms.


In [4157]:
def evaluate_score(model,X,y):
    ''' 
    Function counts cross-validation score, time (overall train + prediction) of model
    Crossval score is f1_macro, cv = 3
    
    '''
    model_instance = model
    time_before = time.time()
    cvs = cross_val_score(model_instance,X,y,cv=3,scoring='f1_macro').mean()
    time_after = time.time()
    result = pd.DataFrame({'name':model.__class__.__name__,'cross_val_score':cvs,
                           'time':time_after - time_before},index=[0])
    return result

Loading dataset.

In [12]:
train = pd.read_csv('../mf-accelerator/contest_train.csv')
test = pd.read_csv('../mf-accelerator/contest_test.csv')
sample = pd.read_csv('../mf-accelerator/sample_subm.csv')

In [14]:
train.TARGET.value_counts()

0    17372
1     5650
2     1499
Name: TARGET, dtype: int64

Adding few new binary features for missing data. If data is missing 1 esle 0.

In [4716]:
na_col = train.isna().sum()>0
d = {True:1,False:0}
na_columns = list(train.columns[na_col])
for i in na_columns:
    train[i+'_na'] = train.loc[:,i].isna()
    train[i+'_na'] = train[i+'_na'].map(d,i+'_na')
    test[i+'_na'] = test.loc[:,i].isna()
    test[i+'_na'] = test[i+'_na'].map(d,i+'_na')

Clean data.

In [4717]:
ID = test.ID
cleaning_na(train)
cleaning_na(test)
X = train.drop(['TARGET'],axis=1)
y = train.TARGET

Scale values with RobustScaler (better than StandartScaler if there are outliers).  
Columns with small amount of unique values were probably categoric. So we can perform one hot encoding.

In [4718]:
feats_to_oh=[]
for i in range(X.shape[1]):
    if len(X.iloc[:,i].unique())<25:
        feats_to_oh.append(i)

In [4719]:
X.iloc[:,feats_to_oh] = X.iloc[:,feats_to_oh].astype('str')
test.iloc[:,feats_to_oh] = test.iloc[:,feats_to_oh].astype('str')

In [4720]:
numerical_cols = np.where((X.dtypes=='float')|(X.dtypes=='int'))[0]
categorical_cols= np.where(X.dtypes=='object')[0]
robust = RobustScaler()
cat = OneHotEncoder(handle_unknown='ignore')

In [4721]:
X.iloc[:,numerical_cols] = robust.fit_transform(X.iloc[:,numerical_cols])
test.iloc[:,numerical_cols] = robust.transform(test.iloc[:,numerical_cols])
X_cat = cat.fit_transform(X.iloc[:,categorical_cols])
test_cat = cat.transform(test.iloc[:,categorical_cols])

In [4722]:
X_all = X.drop(X.columns[categorical_cols],axis=1,inplace=True)
test_all = test.drop(test.columns[categorical_cols],axis=1,inplace=True)
X_all = np.hstack((X,X_cat.toarray()))
test_all = np.hstack((test,test_cat.toarray()))

In [188]:
robust = RobustScaler().fit(X)
X_robust = robust.transform(X)
test_robust = robust.transform(test)

Making holdout fold. Test set should keep origin class distribution.

In [97]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=4,stratify=y)

Some ML algorithms for comparing results.

In [None]:
classifiers_opt = [DecisionTreeClassifier(max_depth=8,min_samples_leaf=2,random_state=4),
                   RidgeClassifier(alpha = 1.5, max_iter = 2000, random_state=4),
                   KNeighborsClassifier(algorithm = 'ball_tree',n_neighbors=3,weights='distance'),
                   RandomForestClassifier(),
                   SGDClassifier(penalty='l1',max_iter=1000,learning_rate='optimal',random_state=4)]

In [None]:
results = pd.DataFrame(columns = ['name','cross_val_score','time'])
for classifier in classifiers_opt:
    results = results.append(evaluate_score(classifier,X_train,y_train))    
results.sort_values(by='cross_val_score')

Ridge classifier with class_weight. Quantity of target classes differs a lot. Class_weight helps to handle class disbalance.

In [3414]:
ridge = RidgeClassifier(random_state=4,alpha=100,max_iter=100,class_weight={0:1,1:2.3,2:3.1})
train = shuffle(train)
cvs = cross_val_score(ridge,train.drop(['TARGET'],axis=1),train.TARGET,cv=5,scoring='f1_macro')
cvs.mean()

0.5185800601109182

In [8]:
train = pd.read_csv('../mf-accelerator/contest_train.csv')
test = pd.read_csv('../mf-accelerator/contest_test.csv')

In [None]:
na_col = train.isna().sum()>0
d = {True:1,False:0}
na_columns = list(train.columns[na_col])
for i in na_columns:
    train[i+'_na'] = train.loc[:,i].isna()
    train[i+'_na'] = train[i+'_na'].map(d,i+'_na')
    test[i+'_na'] = test.loc[:,i].isna()
    test[i+'_na'] = test[i+'_na'].map(d,i+'_na')

In [9]:
ID = test.ID
cleaning_na(train)
cleaning_na(test)
X = train.drop(['TARGET'],axis=1)
y = train.TARGET

Manual correcting class disbalance by removing majority class values. Rows are chosen randomly.

In [3987]:
target_0 = y_train[y_train==0].index
target_1 = y_train[y_train==1].index
target_to_drop_0 = np.random.choice(target_0,14000,replace=False)
target_to_drop_1 = np.random.choice(target_1,3000,replace=False)
to_drop  = np.union1d(target_to_drop_0,target_to_drop_1)
X_train = X_train.drop(to_drop,axis=0)
y_train = y_train.drop(to_drop,axis=0)

In [5243]:
robust = RobustScaler().fit(X_train)
X_train = robust.transform(X_train)
X_test = robust.transform(X_test)
test = robust.transform(test)

NearMiss algorithms for undersampling dataset.

In [44]:
near = NearMiss(sampling_strategy={0:7000})
X_res, y_res = near.fit_resample(X_train_sel, y_train)
X_res,y_res = shuffle(X_res,y_res)

In [45]:
smote = SMOTE(sampling_strategy='auto',random_state=4)
X_res, y_res = smote.fit_resample(X_res, y_res)
X_res,y_res = shuffle(X_res,y_res)

In [46]:
y_res.value_counts()

2    7000
1    7000
0    7000
Name: TARGET, dtype: int64

In [None]:
pca = PCA(n_components=10)

In [17]:
selector = feature_selection.SelectFromModel(LassoCV(random_state=4,max_iter=500)).fit(X_train,y_train)
X_train_sel = selector.transform(X_train)
X_test_sel = selector.transform(X_test)
test_sel = selector.transform(test)


  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descen

  model = cd_fast.enet_coordinate_descent(


In [18]:
X_test_sel.shape

(4905, 50)

In [4249]:
ridge = RidgeClassifier(random_state=4,alpha=1550,max_iter=100,class_weight={0:1,1:2.3,2:3.5})
X_train,y_train = shuffle(X_train,y_train)
cvs = cross_val_score(ridge,X_train,y_train,cv=3,scoring='f1_macro')
ridge.fit(X_train,y_train)
f1 = f1_score(y_test,ridge.predict(X_test),average="macro")
pred = ridge.predict(X_test)
print('cvs:',cvs.mean())
print(f'f1_score:{f1}')
print(confusion_matrix(y_test, pred))

cvs: 0.5125172241320333
f1_score:0.5270640087125986
[[2742  608  125]
 [ 619  431   80]
 [  78   93  129]]


In [4250]:
pred = ridge.predict(test_all)

In [4252]:
write_sumb('ridge_2.csv',pred)

In [4251]:
sum(pred)

4759

In [4253]:
y_test.value_counts()

0    3475
1    1130
2     300
Name: TARGET, dtype: int64

In [22]:
ridge = RidgeClassifier(random_state=4,alpha=10,max_iter=100,class_weight={0:1,1:1,2:1})
X_res,y_res = shuffle(X_res,y_res)
cvs = cross_val_score(ridge,X_res,y_res,cv=3,scoring='f1_macro')
ridge.fit(X_res,y_res)
f1 = f1_score(y_test,ridge.predict(X_test_sel),average="macro")
pred = ridge.predict(X_test_sel)
print('cvs:',cvs.mean())
print(f'f1_score:{f1}')
print(confusion_matrix(y_test, pred))

cvs: 0.5722280299141174
f1_score:0.43332991656047953
[[1833 1064  578]
 [ 396  458  276]
 [  34   45  221]]


Optimal BalancedRandomForestClassifier hyperparameters.

In [8]:
tree_params = {
    'sampling_strategy':['all'],
    'criterion':['entropy'],
    'max_depth':[None],
    'min_samples_split':[2],
    'min_samples_leaf':[2],
    'min_weight_fraction_leaf':[0],
    'max_features':["auto"],
    'max_leaf_nodes':[None],
    'min_impurity_decrease':[0],
    'bootstrap':[False],
    'replacement':[True],
    'ccp_alpha':[0]
}

In [49]:
%%time
#X_train,y_train = shuffle(X_train,y_train,random_state=4)
forest = BalancedRandomForestClassifier(n_estimators=100,random_state=4,bootstrap=False,replacement=True,
                                        n_jobs=4,
                                        criterion='entropy')#,class_weight={0:1,1:0.9,2:0.4})
forest.fit(X_train_sel,y_train)
pred = forest.predict(X_test_sel)

CPU times: user 6.1 s, sys: 164 ms, total: 6.26 s
Wall time: 2.94 s


In [50]:
f1_score(y_test,pred,average='macro')

0.44162617909972174

confusion_matrix shows where model faults.

In [51]:
confusion_matrix(y_test,pred)

array([[2078,  740,  657],
       [ 483,  382,  265],
       [  33,   45,  222]])

In [5303]:
pred = forest.predict(test)

In [5304]:
sum(pred)

5108

In [5305]:
write_sumb('balanced_forest.csv',pred)

In [None]:
#robust data 0.2 split
#0:7000
#n_estimators=5000 0.5293229948836577 (sum 4994)
#n_estimators=100 0.5246443436331151

#n_estimators=100  0.5279228567440777   0:1,1:1.1,2:1.5
#n_estimators=100  0.5287763724236936   0:1,1:0.9,2:1.5
#n_estimators=100  0.5314890514797946   0:1,1:0.9,2:1.55
#n_estimators=5000  0.5309557718088711   0:1,1:0.9,2:1.55
#0.5332703560478363??????
#not robust
#n_estimators=100  0.5282454790166882  0:1,1:0.9,2:1.55 
#n_estimators=100  0.529150988742459  0:1,1:0.9,2:1.5 
# not robust selected 0.005
# 100 0.5303
# 500 0.5306
# 1500 0.5337932867932449
# 3500 0.5328634909230983
# 5000 0.5327493728153337
# 2000 0.532857614263758
# 1000 0.5340548236816324