# Baseline for EMSPM Educational Challenge on AIcrowd
#### Author : Ayush Shivani

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ayushshivani/aicrowd_educational_baselines/blob/master/EMSPM_baseline.ipynb)


## Download Necessary Packages

In [1]:
import sys
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install scikit-learn



## Download data

In [None]:
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_emspm/data/public/test.csv
!wget https://s3.eu-central-1.wasabisys.com/aicrowd-public-datasets/aicrowd_educational_emspm/data/public/train.zip


## Import packages

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import f1_score,precision_score,recall_score,accuracy_score

## Load the data

In [3]:
train_data_path = "train.csv" #path where data is stored

In [4]:
train_data = pd.read_csv(train_data_path) #load data in dataframe using pandas

In [6]:
train_data.head() # Visualize the data

Unnamed: 0,data,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 48,Unnamed: 49,Unnamed: 50,Unnamed: 51,Unnamed: 52,Unnamed: 53,Unnamed: 54,Unnamed: 55,Unnamed: 56,Unnamed: 57
0,0.09,0.0,0.09,0.0,0.39,0.09,0.09,0.0,0.19,0.29,...,0.0,0.139,0.0,0.31,0.155,0.0,6.813,494,1458,1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.124,0.124,0.0,0.0,0.0,0.0,1.8,8,45,0
2,0.0,0.0,2.43,0.0,0.0,0.0,0.0,0.0,0.27,0.0,...,0.0,0.344,0.0,0.0,0.0,0.0,2.319,12,167,0
3,0.0,0.0,0.0,0.0,1.31,0.0,1.31,1.31,1.31,1.31,...,0.0,0.0,0.0,0.117,0.117,0.0,48.5,186,291,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,2.307,8,30,0


Select the clumns which you need to train on. We can also select columns with data type as text , but then we need to map thise text to numbers and tehn select them.

## Split the data in train/Validation

In [7]:
X_train, X_test= train_test_split(train_data, test_size=0.2, random_state=42) 

Check which coloum contains the variable that needs to be predicted. Here it is the last column.

In [8]:
X_train,y_train = X_train.iloc[:,:-1],X_train.iloc[:,-1]
X_test,y_test = X_test.iloc[:,:-1],X_test.iloc[:,-1]

## Define the classifier

In [22]:
classifier = LogisticRegression(solver = 'lbfgs',multi_class='auto', max_iter=10000)

One can set more parameters. To see the list of parameters visit [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

We can also use other classifiers. To read more about sklear classifiers visit [here](https://scikit-learn.org/stable/supervised_learning.html).

## Train the classifier

In [23]:
classifier.fit(X_train, y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=10000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

# Predict on test set

In [24]:
y_pred = classifier.predict(X_test)

## Find the scores 

In [25]:
precision = precision_score(y_test,y_pred,average='micro')
recall = recall_score(y_test,y_pred,average='micro')
accuracy = accuracy_score(y_test,y_pred)
f1 = f1_score(y_test,y_pred,average='macro')

In [26]:
print("Accuracy of the model is :" ,accuracy)
print("Recall of the model is :" ,recall)
print("Precision of the model is :" ,precision)
print("F1 score of the model is :" ,f1)

Accuracy of the model is : 0.9307065217391305
Recall of the model is : 0.9307065217391305
Precision of the model is : 0.9307065217391305
F1 score of the model is : 0.9269453316128429


# Prediction on Evaluation Set

# Load the evaluation data

In [27]:
final_test_path = "test.csv"
final_test = pd.read_csv(final_test_path)


## Predict on evaluation set

In [28]:
submission = classifier.predict(final_test)

## Save the prediction to csv

In [29]:
submission = pd.DataFrame(submission)
submission.to_csv('submission.csv',header=['label'],index=False)

### Go to [platform](https://www.aicrowd.com/challenges/emspm-email-spam-prediction). Participate in the challenge and submit the submission.csv generated.