![dphi banner](https://dphi-courses.s3.ap-south-1.amazonaws.com/Datathons/dphi_banner.png)

# Loading Libraries
All Python capabilities are not loaded into our working environment by default (even those that are already installed in your system). So, we import each and every library that we want to use.

In data science, numpy and pandas are the most commonly used libraries. Numpy is required for calculations like mean, median, square roots, etc. Pandas is used for data processing and data frames. We choose alias names for our libraries for the sake of our convenience (numpy --> np and pandas --> pd).

Note: You can import all the libraries that you think will be required or can import it as you go along. 

Here we will be importing the following libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier
import warnings
warnings.filterwarnings('ignore')

# Loading Dataset
Pandas module is used for reading files.

You can learn more about pandas [here](https://dphi.tech/learn/introduction-to-pandas)

In [4]:
df = pd.read_csv(r"train_dataset.csv")

## What do you need to do now?
*  Perform EDA and Data Visualization, to understand the data. Learn more about EDA [here](https://dphi.tech/learn/introduction-to-exploratory-data-analysis). Learn more about data visualization [here](https://dphi.tech/learn/introduction-to-data-visualization)
*  Clean the data if required (like removing or filling missing values, treat outliers, etc.). Learn more about handling missing values [here](https://youtu.be/EaGbS7eWSs0)
*  Perform Data Preprocessing if you feel it's required. Learn one hot encoding [here](https://youtu.be/9yl6-HEY7_s).

# **Basic EDA**

In [5]:
df.head()

Unnamed: 0,age,height(cm),weight(kg),waist(cm),eyesight(left),eyesight(right),hearing(left),hearing(right),systolic,relaxation,...,HDL,LDL,hemoglobin,Urine protein,serum creatinine,AST,ALT,Gtp,dental caries,smoking
0,35,170,85,97.0,0.9,0.9,1,1,118,78,...,70,142,19.8,1,1.0,61,115,125,1,1
1,20,175,110,110.0,0.7,0.9,1,1,119,79,...,71,114,15.9,1,1.1,19,25,30,1,0
2,45,155,65,86.0,0.9,0.9,1,1,110,80,...,57,112,13.7,3,0.6,1090,1400,276,0,0
3,45,165,80,94.0,0.8,0.7,1,1,158,88,...,46,91,16.9,1,0.9,32,36,36,0,0
4,20,165,60,81.0,1.5,0.1,1,1,109,64,...,47,92,14.9,1,1.2,26,28,15,0,0


In [6]:
df.shape

(38984, 23)

In [7]:
df.describe()

Unnamed: 0,age,height(cm),weight(kg),waist(cm),eyesight(left),eyesight(right),hearing(left),hearing(right),systolic,relaxation,...,HDL,LDL,hemoglobin,Urine protein,serum creatinine,AST,ALT,Gtp,dental caries,smoking
count,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,...,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0,38984.0
mean,44.127591,164.689488,65.938718,82.062115,1.014955,1.008768,1.025369,1.02619,121.475631,75.994408,...,57.293146,115.081495,14.624264,1.086523,0.88603,26.198235,27.145188,39.905038,0.214421,0.367279
std,12.063564,9.187507,12.896581,9.326798,0.498527,0.493813,0.157246,0.159703,13.643521,9.658734,...,14.617822,42.883163,1.566528,0.402107,0.220621,19.175595,31.309945,49.693843,0.410426,0.48207
min,20.0,130.0,30.0,51.0,0.1,0.1,1.0,1.0,71.0,40.0,...,4.0,1.0,4.9,1.0,0.1,6.0,1.0,2.0,0.0,0.0
25%,40.0,160.0,55.0,76.0,0.8,0.8,1.0,1.0,112.0,70.0,...,47.0,91.0,13.6,1.0,0.8,19.0,15.0,17.0,0.0,0.0
50%,40.0,165.0,65.0,82.0,1.0,1.0,1.0,1.0,120.0,76.0,...,55.0,113.0,14.8,1.0,0.9,23.0,21.0,26.0,0.0,0.0
75%,55.0,170.0,75.0,88.0,1.2,1.2,1.0,1.0,130.0,82.0,...,66.0,136.0,15.8,1.0,1.0,29.0,31.0,44.0,0.0,1.0
max,85.0,190.0,135.0,129.0,9.9,9.9,2.0,2.0,233.0,146.0,...,359.0,1860.0,21.1,6.0,11.6,1090.0,2914.0,999.0,1.0,1.0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38984 entries, 0 to 38983
Data columns (total 23 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   age                  38984 non-null  int64  
 1   height(cm)           38984 non-null  int64  
 2   weight(kg)           38984 non-null  int64  
 3   waist(cm)            38984 non-null  float64
 4   eyesight(left)       38984 non-null  float64
 5   eyesight(right)      38984 non-null  float64
 6   hearing(left)        38984 non-null  int64  
 7   hearing(right)       38984 non-null  int64  
 8   systolic             38984 non-null  int64  
 9   relaxation           38984 non-null  int64  
 10  fasting blood sugar  38984 non-null  int64  
 11  Cholesterol          38984 non-null  int64  
 12  triglyceride         38984 non-null  int64  
 13  HDL                  38984 non-null  int64  
 14  LDL                  38984 non-null  int64  
 15  hemoglobin           38984 non-null 

# Separating Input Features and Output Features
Before building any machine learning model, we always separate the input variables and output variables. Input variables are those quantities whose values are changed naturally in an experiment, whereas output variable is the one whose values are dependent on the input variables. So, input variables are also known as independent variables as its values are not dependent on any other quantity, and output variable/s are also known as dependent variables as its values are dependent on other variable i.e. input variables. Like here in this data, we want to predict whether the meteor is a threat to the Earth or not, so the variable **Hazardous** is our target variable and remaining features are input variable.

By convention input variables are represented with 'X' and output variables are represented with 'y'.

In [9]:
df.isna().sum()

age                    0
height(cm)             0
weight(kg)             0
waist(cm)              0
eyesight(left)         0
eyesight(right)        0
hearing(left)          0
hearing(right)         0
systolic               0
relaxation             0
fasting blood sugar    0
Cholesterol            0
triglyceride           0
HDL                    0
LDL                    0
hemoglobin             0
Urine protein          0
serum creatinine       0
AST                    0
ALT                    0
Gtp                    0
dental caries          0
smoking                0
dtype: int64

In [10]:
X = df.drop('smoking', axis = 1)   # here we are dropping the Target feature as this is the target and 'X' is input features, the changes are not 
                                              # made inplace as we have not used 'inplace = True'
y = df['smoking']             # Output/Dependent variable

# Splitting the data into Train and Test Set
We want to check the performance of the model that we built. For this purpose, we always split (both input and output data) the given data into training set which will be used to train the model, and test set which will be used to check how accurately the model is predicting outcomes.

For this purpose we have a class called 'train_test_split' in the 'sklearn.model_selection' module.


In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [12]:
X_train.shape

(27288, 22)

In [13]:
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [14]:
X_train

array([[ 0.48775941, -1.05679505, -1.23051155, ..., -0.26856198,
        -0.39825784,  1.90417142],
       [-2.00101745,  0.57544218, -0.84472818, ..., -0.44402813,
        -0.43822739, -0.5251628 ],
       [-0.34183288,  0.57544218,  0.69840533, ...,  1.36912211,
         3.29892526, -0.5251628 ],
       ...,
       [-0.34183288,  1.11952125,  0.31262195, ...,  0.31632519,
         1.58023474, -0.5251628 ],
       [ 0.07296326,  0.57544218,  0.69840533, ...,  0.78423493,
         0.46108742, -0.5251628 ],
       [-0.34183288, -1.05679505, -0.84472818, ..., -0.50251685,
        -0.59810557, -0.5251628 ]])

# Building Model
Now we are finally ready, and we can train the model.

There are tons of Machine Learning models like Logistic Regression, Random Forest, Decision Tree, etc. to say you some. However here we are using RandomForest Classifier (using the sklearn library).

Then we would feed the model both with the data (X_train) and the answers for that data (y_train)

### Train the model

In [15]:
from sklearn.linear_model import LogisticRegression

In [16]:
model = LogisticRegression()

In [17]:
model.fit(X_train,y_train)

LogisticRegression()

In [18]:
print(X_train.shape,y_train.shape)

(27288, 22) (27288,)


# Validate The Model
Wonder🤔 how well your model learned! Lets check it.

# Predict on the testing data (X_test)
Now we predict using our trained model on the test set we created i.e. X_test and evaluate our model on unforeseen data.

In [19]:
pred = model.predict(X_test)

## Model Evaluation
Evaluating performance of the machine learning model that we have built is an essential part of any machine learning project. Performance of our model is done using some evaluation metrics.

There are so many evaluation metrics to use for regression problem, naming some - Accuracy Score, F1 Score, Precision, Recall etc. However, **Accuracy Score** is the metric for this challenge. 

In [20]:
from sklearn.metrics import accuracy_score

# Checking the accuracy of the Validation dataset

print('Accuracy score',accuracy_score(X_test,y_test))

In [21]:
print('Accuracy score',accuracy_score(pred,y_test))

Accuracy score 0.72734268125855


Predict The Output For Testing Dataset 😅
We have trained our model, evaluated it and now finally we will predict the output/target for the testing data (i.e. testing_set_label.csv) given in 'Data' section of the problem page.

## Load Test Set
Load the test data on which final submission is to be made.

In [22]:
test_data = pd.read_csv(r'test_dataset.csv')

**Note:** 
*  Use the same techniques to deal with missing values as done with the training dataset.   

*  **Don't remove any observation/record from the test dataset otherwise you will get wrong answer. The number of items in your prediction should be same as the number of records are present in the test dataset**.

*  Use the same techniques to preprocess the data as done with training dataset.

***Why do we need to do the same procedure of filling missing values, data cleaning and data preprocessing on the new test data as it was done for the training and validation data?***

**Ans:** Because our model has been trained on certain format of data and if we don't provide the testing data of the similar format, the model will give erroneous predictions and the rmse of the model will increase. Also, if the model was build on 'n' number of features, while predicting on new test data you should always give the same number of features to the model. In this case if you provide different number of features while predicting the output, your ML model will throw a ValueError saying something like 'number of features given x; expecting n'. Not confident about these statements? Well, as a data scientist you should always perform some experiment and observe the results.



In [23]:
test_data.head()

Unnamed: 0,age,height(cm),weight(kg),waist(cm),eyesight(left),eyesight(right),hearing(left),hearing(right),systolic,relaxation,...,triglyceride,HDL,LDL,hemoglobin,Urine protein,serum creatinine,AST,ALT,Gtp,dental caries
0,40,170,65,75.1,1.0,0.9,1,1,120,70,...,260,41,132,15.7,1,0.8,24,26,32,0
1,45,170,75,89.0,0.7,1.2,1,1,100,67,...,345,49,140,15.7,1,1.1,26,28,138,0
2,30,180,90,94.0,1.0,0.8,1,1,115,72,...,103,53,103,13.5,1,1.0,19,29,30,0
3,60,170,50,73.0,0.5,0.7,1,1,118,78,...,70,65,108,14.1,1,1.3,31,28,33,0
4,30,170,65,78.0,1.5,1.0,1,1,110,70,...,210,45,103,14.7,1,0.8,21,21,19,0


In [24]:
test_data.isna().sum()

age                    0
height(cm)             0
weight(kg)             0
waist(cm)              0
eyesight(left)         0
eyesight(right)        0
hearing(left)          0
hearing(right)         0
systolic               0
relaxation             0
fasting blood sugar    0
Cholesterol            0
triglyceride           0
HDL                    0
LDL                    0
hemoglobin             0
Urine protein          0
serum creatinine       0
AST                    0
ALT                    0
Gtp                    0
dental caries          0
dtype: int64

## Make Prediction on Test Dataset
Time to make submission!!!

In [25]:
target = model.predict(test_data)

#### Note: **Follow the submission guidelines given in 'How To Submit' Section.**

## How to save prediciton results locally via jupyter notebook?
If you are working on Jupyter notebook, execute below block of codes. A file named 'prediction_results.csv' will be created in your current working directory.

In [26]:
#target = pd.read_csv(r'test_ans.csv)
res = pd.DataFrame(target) #target is nothing but the final predictions of your model on input features of your new unseen test data
res.columns = ["smoking"]
res.to_csv("submission.csv", index = False)      # the csv file will be saved locally on the same location where this notebook is located.

# **OR**, 
**if you are working on Google Colab then use the below set of code to save prediction results locally**

## How to save prediction results locally via colab notebook?
If you are working on Google Colab Notebook, execute below block of codes. A file named 'prediction_results' will be downloaded in your system.

In [27]:
# To create Dataframe of predicted value with particular respective index
#target = pd.read_csv(r'/content/test_ans.csv')
res = pd.DataFrame(target) # target are nothing but the final predictions of your model on input features of your new unseen test data
res.columns = ["smoking"]

# To download the csv file locally
from google.colab import files
res.to_csv('submission.csv', index = False)         
files.download('submission.csv')

ModuleNotFoundError: No module named 'google.colab'

## **Well Done! 👍**
You are all set to make a submission. Let's head to the challenge page to make the submission.