# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint

## Learning Objective

At the end of this experiment, you will be able to:

* Perform Data preprocessing

## Dataset

### Description

We will be using district wise demographics, enrollments, school and teacher indicator data to predict whether the literacy rate is high / medium / low in each district.

### Data Preprocessing

Data preprocessing is an important step of solving every machine learning problem. Most of
the datasets used with Machine Learning problems need to be processed / cleaned / transformed
so that a Machine Learning algorithm can be trained on it.

There are different steps involved for Data Preprocessing. These steps are as follows:

    1. Data Cleaning → In this step the primary focus is on
        -Handling missing data
        -Handling nosiy data
        -Detection and removal of outliers
    
    2. Data Integration → This process is used when data is gathered from various data sources
    and data are combined to form consistent data. This data after performing cleaning is used
    for analysis.
    
    3. Data Transformation → In this step we will convert the raw data into a specified for-
    mat according to the need of the model we are building. There are many options used for
    transforming the data as below:
        -Normalization
        -Aggregation
        -Generalization
        
    4. Data Reduction → After data transformation and scaling the redundancy within the data
    is removed and efficiently organizing the data is performed.



### Total Marks  = 20

### Setup Steps

In [0]:
#@title Please enter your registration id to start: (e.g. P181900101) { run: "auto", display-mode: "form" }
Id = "P181902118" #@param {type:"string"}


In [0]:
#@title Please enter your password (normally your phone number) to continue: { run: "auto", display-mode: "form" }
password = "8860303743" #@param {type:"string"}


In [36]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()
  
notebook="1_Mini_HCK_Data_Munging" #name of the notebook
Answer = "This notebook is evaluated by mentors during the lab"

def setup():
#  ipython.magic("sx pip3 install torch")  
    #  ipython.magic("sx pip3 install torch")
    ipython.magic("sx wget https://cdn.talentsprint.com/aiml/Experiment_related_data/data-20190108T113429Z-001.zip")
    ipython.magic("sx unzip data-20190108T113429Z-001.zip")
    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    
    ipython.magic("notebook -e "+ notebook + ".ipynb")
    
    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:        
        print(r["err"])
        return None        
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional, 
              "concepts" : Concepts, "record_id" : submission_id, 
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook}

      r = requests.post(url, data = data)
      r = json.loads(r.text)
      print("Your submission is successful.")
      print("Ref Id:", submission_id)
      print("Date of submission: ", r["date"])
      print("Time of submission: ", r["time"])
      print("View your submissions: https://iiith-aiml.talentsprint.com/notebook_submissions")
      print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
      return submission_id
    else: submission_id
    

def getAdditional():
  try:
    if Additional: return Additional      
    else: raise NameError('')
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None
  
def getConcepts():
  try:
    return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None

def getAnswer():
  try:
    return Answer
  except NameError:
    print ("Please answer Question")
    return None

def getId():
  try: 
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup 
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
    from IPython.display import HTML
    HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id))
  
else:
  print ("Please complete Id and Password cells before running setup")



Setup completed successfully


#### Exercise 1 - (2 Marks)
We have four different files

* Districtwise_Basicdata.csv
* Districtwise_Enrollment_details_indicator.csv
* Districtwise_SchoolData.csv
* Districtwise_Teacher_indicator.csv
These files contain the neccesary data to solve the problem.
Load all the files correctly, after observing the header level details, data records etc

Hint : Use read_csv from pandas

In [0]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from collections import Counter
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import pandas as pd

basic_data = pd.read_csv('data/Districtwise_Basicdata.csv', skiprows=[0])
enroll_data = pd.read_csv('data/Districtwise_Enrollment_details_indicator.csv', skiprows=[0, 1, 2])
school_data = pd.read_csv('data/Districtwise_SchoolData.csv', skiprows=3)
teacher_data = pd.read_csv('data/Districtwise_Teacher_indicator.csv', skiprows=3)

basic_data = basic_data.rename(index=str, columns={'Year': 'year', 'Statecd': 'statecd'})
enroll_data = enroll_data.rename(index=str, columns={'Year': 'year', 'Statecd': 'statecd'})
school_data = school_data.rename(index=str, columns={'ac_year': 'year', 'Statecd': 'statecd'})
teacher_data = teacher_data.rename(index=str, columns={'ac_year': 'year', 'Statecd': 'statecd'})



#### Exercise 2  - (4 Marks)

* Remove the unwanted columns, which are unlikely to contribute for the prediction of overall literacy grade. The decision of what constitutes unwanted columns depends on how it effects your final accuracy (and very little on your domain understanding of education sector in India; you're encouraged however to exercise some domain understanding too if you wish)

**Hint** use pandas drop function to drop your choice of unwanted columns (if any).


* As the required data is present in different files, we need to integrate all the four to make single dataframe/dataset. For that purpose, create a unique identifier for each row in all the dataframes so that it can be used to map the data in different files correctly
* Join/integrate this data 

Example : data of the district ananthapur in Andrapradesh, which present in different files should form a single row 

Hint : 
* Use the combination of year, statecode, district code as unique identifier 

* Refer the following link for merge, join and concat syntaxes:  

https://pandas.pydata.org/pandas-docs/stable/merging.html


In [66]:
# Your Code Here

basic_plus_enroll = basic_data.merge(enroll_data, on=['year', 'statecd', 'distcd'])
basic_plus_enroll_plus_school = basic_plus_enroll.merge(school_data, on=['year', 'statecd', 'distcd'])
merged_data = basic_plus_enroll_plus_school.merge(teacher_data, on=['year', 'statecd', 'distcd'])

cols_to_drop = ['Gtoilet Sch1', 'Gtoilet Sch2', 'Gtoilet Sch3', 'Gtoilet Sch4', 'Gtoilet Sch5', 'Gtoilet Sch6', 'Gtoilet Sch7', 'Gtoilet Sch']
cols_to_drop += ['Btoilet Sch1', 'Btoilet Sch2', 'Btoilet Sch3', 'Btoilet Sch4', 'Btoilet Sch5', 'Btoilet Sch6', 'Btoilet Sch7']
cols_to_drop += ['Uniform P B', 'Uniform P G', 'Uniform Up B', 'Uniform Up G']
cols_to_drop += ['No Fem Sch1', 'No Fem Sch2', 'No Fem Sch3', 'No Fem Sch4', 'No Fem Sch5', 'No Fem Sch6', 'No Fem Sch7', 'No Fem Sch']
cols_to_drop += ['Residential P B', 'Residential P G', 'Residential Up B', 'Residential Up G']
cols_to_drop += ['Transport P B', 'Transport P G', 'Transport Up B', 'Transport Up G']
cols_to_drop += ['Station P B', 'Station P G', 'Station Up B', 'Station Up G']
cols_to_drop += ['Computer Sch1', 'Computer Sch2', 'Computer Sch3', 'Computer Sch4', 'Computer Sch5', 'Computer Sch6', 'Computer Sch7']
cols_to_drop += ['Kitshed1', 'Kitshed2', 'Kitshed3', 'Kitshed4', 'Kitshed5', 'Kitshed6', 'Kitshed7']
cols_to_drop += ['blocks', 'clusters', 'villages', 'p_06_pop', 'sexratio_06', 'growthrate', 'p_sc_pop', 'p_st_pop', 'State Name _x', 'distname_y', 'Enr Govt1', 'Enr Govt2', 'Enr Govt3', 'Enr Govt4', 'Enr Govt5', 'Enr Govt6', 'Enr Govt7', 'Enr Govt9', 'Enr Pvt1', 'Enr Pvt2', 'Enr Pvt3', 'Enr Pvt4', 'Enr Pvt5', 'Enr Pvt6', 'Enr Pvt7', 'Enr Pvt9', 'Enr R Govt1', 'Enr R Govt2', 'Enr R Govt3', 'Enr R Govt4', 'Enr R Govt5', 'Enr R Govt6', 'Enr R Govt7', 'Enr R Govt9', 'Enr R Pvt1', 'Enr R Pvt2', 'Enr R Pvt3', 'Enr R Pvt4', 'Enr R Pvt5', 'Enr R Pvt6', 'Enr R Pvt7', 'Enr R Pvt9', 'Enr Py4 C1', 'Enr Py4 C2', 'Enr Py4 C3', 'Enr Py4 C4', 'Enr Py4 C5', 'Enr Py4 C6', 'Enr Py4 C7', 'Enr Py4 C8', 'Enr Py3 C1', 'Enr Py3 C2', 'Enr Py3 C3', 'Enr Py3 C4', 'Enr Py3 C5', 'Enr Py3 C6', 'Enr Py3 C7', 'Enr Py3 C8', 'Enr Py2 C1', 'Enr Py2 C2', 'Enr Py2 C3', 'Enr Py2 C4', 'Enr Py2 C5', 'Enr Py2 C6', 'Enr Py2 C7', 'Enr Py2 C8', 'Enr Py1 C1', 'Enr Py1 C2', 'Enr Py1 C3', 'Enr Py1 C4', 'Enr Py1 C5', 'Enr Py1 C6', 'Enr Py1 C7', 'Enr Py1 C8', 'Enr Cy C1', 'Enr Cy C2', 'Enr Cy C3', 'Enr Cy C4', 'Enr Cy C5', 'Enr Cy C6', 'Enr Cy C7', 'Enr Cy C8', 'Sc Enrp Cy', 'Sc Enrup Cy', 'Scg Enrp Cy', 'Scg Enrup Cy', 'St Enrp Cy', 'Stg Enrp Cy', 'St Enrup Cy', 'Stg Enrup Cy', 'Gerp Py2', 'Gerp Py1', 'Gerp Cy', 'Gerup Py2', 'Gerup Py1', 'Gerup Cy', 'Nerp Py2', 'Nerp Py1', 'Nerp Cy', 'Nerup Py2', 'Nerup Py1', 'Nerup Cy', 'Pc Girls1', 'Pc Girls2', 'Pc Girls3', 'Pc Girls4', 'Pc Girls5', 'Pc Girls', 'Enr G C1', 'Enr G C2', 'Enr G C3', 'Enr G C4', 'Enr G C5', 'Enr G C6', 'Enr G C7', 'Enr G C8', 'Enr Dis B C1', 'Enr Dis B C2', 'Enr Dis B C3', 'Enr Dis B C4', 'Enr Dis B C5', 'Enr Dis B C6', 'Enr Dis B C7', 'Enr Dis B C8', 'Enr Dis G C1', 'Enr Dis G C2', 'Enr Dis G C3', 'Enr Dis G C4', 'Enr Dis G C5', 'Enr Dis G C6', 'Enr Dis G C7', 'Enr Dis G C8', 'Grossness P', 'Grossness Up', 'Enr Med1 1', 'Enr Med1 2', 'Enr Med1 3', 'Enr Med1 4', 'Enr Med1 5', 'Enr Med1 6', 'Enr Med1 7', 'Enr Med2 1', 'Enr Med2 2', 'Enr Med2 3', 'Enr Med2 4', 'Enr Med2 5', 'Enr Med2 6', 'Enr Med2 7', 'Enr Med3 1', 'Enr Med3 2', 'Enr Med3 3', 'Enr Med3 4', 'Enr Med3 5', 'Enr Med3 6', 'Enr Med3 7', 'Rep C1', 'Rep C2', 'Rep C3', 'Rep C4', 'Rep C5', 'Rep C6', 'Rep C7', 'Rep C8', 'Muslim P', 'Muslim Up', 'Muslim G P', 'Muslim G Up', 'Obc P', 'Obc Up', 'Obc G P', 'Obc G Up', 'State Name _y', 'distname_x', 'schgovt1', 'schgovt2', 'schgovt3', 'schgovt4', 'schgovt5', 'schgovt6', 'schgovt7', 'schgovt9', 'schpvt1', 'schpvt2', 'schpvt3', 'schpvt4', 'schpvt5', 'schpvt6', 'schpvt7', 'schpvt9', 'Sch R Govt1', 'Sch R Govt2', 'Sch R Govt3', 'Sch R Govt4', 'Sch R Govt5', 'Sch R Govt6', 'Sch R Govt7', 'Sch R Govt9', 'Sch R Pvt1', 'Sch R Pvt2', 'Sch R Pvt3', 'Sch R Pvt4', 'Sch R Pvt5', 'Sch R Pvt6', 'Sch R Pvt7', 'Sch R Pvt9', 'Cls1 School1', 'Cls1 School2', 'Cls1 School3', 'Cls1 School4', 'Cls1 School5', 'Cls1 School6', 'Cls1 School7', 'Cls1 School', 'Tch1 School1', 'Tch1 School2', 'Tch1 School3', 'Tch1 School4', 'Tch1 School5', 'Tch1 School6', 'Tch1 School7', 'Tch1 School', 'Pp Sch1', 'Pp Sch2', 'Pp Sch3', 'Pp Sch6', 'Water Sch1', 'Water Sch2', 'Water Sch3', 'Water Sch4', 'Water Sch5', 'Water Sch6', 'Water Sch7', 'Water Sch', 'Enr Stch Sch1', 'Enr Stch Sch2', 'Enr Stch Sch3', 'Enr Stch Sch4', 'Enr Stch Sch5', 'Enr Stch Sch6', 'Enr Stch Sch7', 'Enr Stch Sch', 'Sch 50enr1', 'Sch 50enr2', 'Sch 50enr3', 'Sch 50enr4', 'Sch 50enr5', 'Sch 50enr6', 'Sch 50enr7', 'Sch 50enr', 'Sch Since 2003 1', 'Sch Since 2003 2', 'Sch Since 2003 3', 'Sch Since 2003 4', 'Sch Since 2003 5', 'Sch Since 2003 6', 'Sch Since 2003 7', 'Tot Cls1', 'Tot Cls2', 'Tot Cls3', 'Tot Cls4', 'Tot Cls5', 'Tot Cls6', 'Tot Cls7', 'Tot Cls', 'Cls Good1', 'Cls Good2', 'Cls Good3', 'Cls Good4', 'Cls Good5', 'Cls Good6', 'Cls Good7', 'Cls Good', 'Cls Major1', 'Cls Major2', 'Cls Major3', 'Cls Major4', 'Cls Major5', 'Cls Major6', 'Cls Major7', 'Cls Major', 'Cls Minor1', 'Cls Minor2', 'Cls Minor3', 'Cls Minor4', 'Cls Minor5', 'Cls Minor6', 'Cls Minor7', 'Cls Minor', 'Cls Other1', 'Cls Other2', 'Cls Other3', 'Cls Other4', 'Cls Other5', 'Cls Other6', 'Cls Other7', 'Cls Other', 'Sdg 1', 'Sdg 2', 'Sdg 3', 'Sdg 4', 'Sdg 5', 'Sdg 6', 'Sdg 7', 'Tlm 1', 'Tlm 2', 'Tlm 3', 'Tlm 4', 'Tlm 5', 'Tlm 6', 'Tlm 7', 'Book P B', 'Book P G', 'Book Up B', 'Book Up G', 'Attend P B', 'Attend P G', 'Attend Up B', 'Attend Up G', 'Sch Un1', 'Sch Un2', 'Sch Un3', 'Sch Un4', 'Sch Un5', 'Sch Un6', 'Sch Un7', 'Sch Un9',  'Electric Sch1', 'Electric Sch2', 'Electric Sch3', 'Electric Sch4', 'Electric Sch5', 'Electric Sch6', 'Electric Sch7', 'Mdm 1', 'Mdm 2', 'Mdm 3', 'Mdm 4', 'Mdm 5', 'Mdm 6', 'Mdm 7', 'Smc 1', 'Smc 2', 'Smc 3', 'Smc 4', 'Smc 5', 'Smc 6', 'Smc 7', 'App By Road 1', 'App By Road 2', 'App By Road 3', 'App By Road 4', 'App By Road 5', 'App By Road 6', 'App By Road 7', 'Scr 30 P', 'Scr 35 Up', 'Ptr 30 P', 'Ptr 35 Up', 'Avg Instn Days P', 'Avg Instn Days Up', 'statename_y', 'distname_y', 'tch_govt1', 'tch_govt2', 'tch_govt3', 'tch_govt4', 'tch_govt5', 'tch_govt6', 'tch_govt7', 'tch_govt9', 'tch_pvt1', 'tch_pvt2', 'tch_pvt3', 'tch_pvt4', 'tch_pvt5', 'tch_pvt6', 'tch_pvt7', 'tch_pvt9', 'tch_un1', 'tch_un2', 'tch_un3', 'tch_un4', 'tch_un5', 'tch_un6', 'tch_un7', 'tch_un9', 'tch_bs1', 'tch_bs2', 'tch_bs3', 'tch_bs4', 'tch_bs5', 'tch_bs6', 'tch_bs7', 'tch_bs_p', 'tch_s1', 'tch_s2', 'tch_s3', 'tch_s4', 'tch_s5', 'tch_s6', 'tch_s7', 'tch_s_p', 'tch_hs1', 'tch_hs2', 'tch_hs3', 'tch_hs4', 'tch_hs5', 'tch_hs6', 'tch_hs7', 'tch_hs_p', 'tch_grad1', 'tch_grad2', 'tch_grad3', 'tch_grad4', 'tch_grad5', 'tch_grad6', 'tch_grad7', 'tch_grad_p', 'tch_pgrad1', 'tch_pgrad2', 'tch_pgrad3', 'tch_pgrad4', 'tch_pgrad5', 'tch_pgrad6', 'tch_pgrad7', 'tch_pgrad_p', 'tch_mph1', 'tch_mph2', 'tch_mph3', 'tch_mph4', 'tch_mph5', 'tch_mph6', 'tch_mph7', 'tch_mph_p', 'tch_pd1', 'tch_pd2', 'tch_pd3', 'tch_pd4', 'tch_pd5', 'tch_pd6', 'tch_pd7', 'tch_pd_p', 'tch_eduqual_nr1', 'tch_eduqual_nr2', 'tch_eduqual_nr3', 'tch_eduqual_nr4', 'tch_eduqual_nr5', 'tch_eduqual_nr6', 'tch_eduqual_nr7', 'tch_eduqual_nr_p', 'tch_m1', 'tch_m2', 'tch_m3', 'tch_m4', 'tch_m5', 'tch_m6', 'tch_m7', 'tch_f1', 'tch_f2', 'tch_f3', 'tch_f4', 'tch_f5', 'tch_f6', 'tch_f7', 'tch_nr1', 'tch_nr2', 'tch_nr3', 'tch_nr4', 'tch_nr5', 'tch_nr6', 'tch_nr7', 'tch_m_p1', 'tch_m_p2', 'tch_m_p3', 'tch_m_p4', 'tch_m_p5', 'tch_m_p6', 'tch_m_p7', 'tch_f_p1', 'tch_f_p2', 'tch_f_p3', 'tch_f_p4', 'tch_f_p5', 'tch_f_p6', 'tch_f_p7', 'tch_nr_p1', 'tch_nr_p2', 'tch_nr_p3', 'tch_nr_p4', 'tch_nr_p5', 'tch_nr_p6', 'tch_nr_p7', 'tch_sc_m1', 'tch_sc_m2', 'tch_sc_m3', 'tch_sc_m4', 'tch_sc_m5', 'tch_sc_m6', 'tch_sc_m7', 'tch_sc_f1', 'tch_sc_f2', 'tch_sc_f3', 'tch_sc_f4', 'tch_sc_f5', 'tch_sc_f6', 'tch_sc_f7', 'tch_st_m1', 'tch_st_m2', 'tch_st_m3', 'tch_st_m4', 'tch_st_m5', 'tch_st_m6', 'tch_st_m7', 'tch_st_f1', 'tch_st_f2', 'tch_st_f3', 'tch_st_f4', 'tch_st_f5', 'tch_st_f6', 'tch_st_f7', 'trn_tch_m1', 'trn_tch_m2', 'trn_tch_m3', 'trn_tch_m4', 'trn_tch_m5', 'trn_tch_m6', 'trn_tch_m7', 'trn_tch_f1', 'trn_tch_f2', 'trn_tch_f3', 'trn_tch_f4', 'trn_tch_f5', 'trn_tch_f6', 'trn_tch_f7', 'prof_trn_tch_r', 'prof_trn_tch_p', 'days_nontch', 'tch_nontch']

data = merged_data.drop(columns=cols_to_drop)

print(data.head())

      year  statecd                                        statename_x  \
0  2012-13       35  ANDAMAN & NICOBAR ISLANDS                     ...   
1  2012-13       35  ANDAMAN & NICOBAR ISLANDS                     ...   
2  2012-13       35  ANDAMAN & NICOBAR ISLANDS                     ...   
3  2012-13       28  ANDHRA PRADESH                                ...   
4  2012-13       28  ANDHRA PRADESH                                ...   

   distcd  totschools  totpopulation  p_urb_pop  sexratio overall_lit  \
0    3501         212       237586.0      55.89     874.0        High   
1    3503         181       105539.0       2.60     925.0        High   
2    3502          58        36819.0       0.00     778.0        High   
3    2801        4983      2737738.0      27.68    1003.0         Low   
4    2822        5188      4083315.0      28.09     977.0         Low   

   female_lit  
0       84.52  
1       79.39  
2       70.70  
3       51.99  
4       54.31  


Follow this steps in order to clean the data:

#### Exercise 3 - (3 Marks)

* Overall_lit is our target variable, which we need to predict. Delete the row with missing overall_lit column
* Take a call to replace the missing values in any other column appropriately with mean/median/mode
* Convert categorical values to numerical values
Example : If a feature contains categorical values such as dog, cat, mouse etc then replace them with 1, 2, 3 etc or using one hot encoding (your judgement)

*Hint* :
* Use pandas fillna function to replace the missing values

In [67]:
# Your Code Here

def overall_lit(n):
  if not type(n)==type(""):
    return -1
  if n.lower()=='high':
    n=2
  elif n.lower()=='medium':
    n=1
  elif n.lower()=='low':
    n=0
  return n

data['overall_lit']=data['overall_lit'].apply(overall_lit)


data = data[data.overall_lit != -1 ]

data['overall_lit'].fillna((data['overall_lit'].mode()),inplace=True)
data.isna().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


year             0
statecd          0
statename_x      0
distcd           0
totschools       0
totpopulation    0
p_urb_pop        6
sexratio         0
overall_lit      0
female_lit       0
dtype: int64

#### Exercise 4 - (3 Marks)

Use the functions below to adjust the outliers

smooth_out function takes pandas dataframe as input and caculates mean, standard deviation of every column to check whether all the values in that lies within the range of mean +/- 2*standard_deviation of that column or not.
If any of the values are not present in that boundary, then that values is brought on to the boundary.

**Hint:** Should  the index column be normalized too? 

<img src="https://cdn.talentsprint.com/aiml/Experiment_related_data/normal_dist.png">

In [0]:
# Function to clip and clam the data
def clip_clamp(x, mean, sd):
    # Checking whether the value is less than a differenced value between mean and standard deviation.
    if x < mean - 2*sd :
        return mean - 2*sd
    #Checking whether the value is greater than a differenced value between mean and standard deviation.
    elif x > mean + 2*sd :
        return mean + 2*sd
    # If above two conditions are not statisfied we will return the original value
    else :
        return x

In [0]:
# Function to smooth the data
def smooth_out(Total_data):
    for i in Total_data.columns:
        # Calculating the mean value
        mean = np.mean(Total_data[i].values, axis=0)
        # Calculating the standard deviation value
        sd = np.std(Total_data[i].values, axis=0)
        # Calculating the corrected value using clip and clamp function
        corrected = np.array([clip_clamp(x, mean, sd) for x in Total_data[i].values])
        # Storing the data in form of series
        Total_data[i] = pd.Series(corrected, index=Total_data[i].index)
    return Total_data

In [0]:
# Your Code Here

#### Exercise 5 - (2 Marks)

Use the function below (corr_features) to identify uncorrelated features and remove the remaining features
* corr_features takes pandas dataframe, columns in the dataframe and bar (corelation co-efficient)

In [68]:
# Function to find uncorrelated features
def corr_features(df,cols,bar=0.9):
    for c,i in enumerate(cols[:-1]):
        col_set = set(cols)
        for j in cols[c+1:]:
            if i==j:
                continue
           
            score = df[i].corr(df[j])
            
            if score>bar:
                cols = list(col_set-set([j]))
            if score<-bar:
                cols = list(col_set-set([j]))
    return cols



numeric_cols = []
for col in data.columns:
    try:
        if type(data[col][0]) != str:
            numeric_cols.append(col)
    except:
        pass
# print(numeric_cols)
filtered_cols = corr_features(data, numeric_cols)
print(filtered_cols)


['statecd', 'sexratio', 'totschools', 'p_urb_pop', 'totpopulation', 'female_lit', 'overall_lit']


#### Exercise 6 - (3 Marks)

Perform Mean Correction and Standard Scaling on the data feature/column wise.

**Hint:** In order to understand the idea behind the terms used above, you may refer the following link: 

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [69]:

tmp_df = df.select_dtypes(exclude=['object'])

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(copy = False)
print(scaler.fit(tmp_df))
print(scaler.mean_)

StandardScaler(copy=False, with_mean=True, with_std=True)
[1.71087613e+01 1.72739124e+03 1.09166994e+03 2.67594411e+02
 5.77643505e+00 1.81322727e+02 4.37802115e+01 1.58328290e+01
 4.11470811e+01 6.50529501e-01 1.69873867e+02 1.29706949e+02
 4.14863636e+01 3.99629349e+01 2.96301059e+01 3.34478852e+01
 3.04198185e+01 4.47987851e-02 1.02338066e+03 2.42568405e+02
 3.23222390e+00 1.71014372e+02 3.53850227e+01 1.42662632e+01
 3.59750755e+01 6.98027314e-02 1.15727341e+02 7.36465257e+01
 1.93547655e+01 3.31014383e+01 1.90605144e+01 1.82900302e+01
 2.03413897e+01 2.95230886e-02 9.69478852e+01 2.53247734e+00
 3.90778534e-01 4.00453172e+00 1.31873112e+00 4.09365559e-01
 5.11858006e+00 1.10756798e+02 1.50400302e+02 4.73187311e+00
 3.75377644e-01 2.73776435e+01 7.36404834e-01 3.04380665e-01
 6.25377644e-01 1.84563444e+02 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 8.76871526e+01 9.33460045e+01
 9.01632704e+01 8.11135423e+01 8.81542447e+01 8.67958837e+01
 7.79562387e+01 9.00857175e

  return self.partial_fit(X, y)


#### Exercise 7 - (3 Marks)

Apply different classifiers on the preprocessed data and figure out which classifier gives the best result.

In [80]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import linear_model


#KNN Function
def callKnn(data,targets):
  k = math.sqrt(data.shape[0])
  X_train, X_test, y_train, y_test = train_test_split(data, targets, test_size=0.2, random_state=42)
  neigh = KNeighborsClassifier(n_neighbors=3)
  neigh.fit(X_train, y_train)
  predicted_labels = neigh.predict(X_test)
  return accuracy_score(y_test,predicted_labels)


# Decision Tree
def callDecisionTree(data, targets):
  X_train, X_test, y_train, y_test = train_test_split(data, targets, test_size=0.2, random_state=42)
  decision_tree = DecisionTreeClassifier(max_depth=7)
  decision_tree.fit(X_train,y_train)
  decision_tree.predict(X_test)
  return decision_tree.score(X_test,y_test)

#Linear Class func
def callLinearClassifier(data, targets):
  X_train, X_test, y_train, y_test = train_test_split(data, targets, test_size=0.2, random_state=42)
  linear_classifier = linear_model.SGDClassifier()
  linear_classifier.fit(X_train,y_train)
  linear_classifier.predict(X_test)
  return linear_classifier.score(X_test,y_test)


tmp_df = df.select_dtypes(exclude=['object'])
data 
print(callKnn(data[['totschools', 'totpopulation', 'sexratio', 'female_lit']], data['overall_lit']))
print(callDecisionTree(data[['totschools', 'totpopulation', 'sexratio', 'female_lit']], data['overall_lit']))
print(callLinearClassifier(data[['totschools', 'totpopulation', 'sexratio', 'female_lit']], data['overall_lit']))



0.468503937007874
0.9566929133858267
0.48031496062992124




### Replace any of the above given functions and get correct results to get excellence

In [0]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = " " #@param ["Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [0]:
#@title If it was very easy, what more you would have liked to have been added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = " " #@param {type:"string"}

In [0]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = " " #@param ["Yes", "No"]

In [0]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id =return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")