# Data Mining
## Assignment 1 - Data Cleansing

---

## Aadam - CS1945
## Obaidullah - CS1947

---

## Tasks to do
1. Combine all the three datasets in a single source, you may use any DBMS. 
2. Once you have the data in the staging area inside your DBMS, perform data profiling for all the fields. By data profiling means the following statistics:
    - No. of unique values (for each column) 
    - No. of nulls (for each column) 
    - Invalid values (for each column) 
    - Total no. of courses 
    - Total no. of female students Vs male students 
    - Total no of people who has taken more than 5 courses in a semester. 
    - The relationship between total no. of unique student ID’s and total no. of students. 
    - Average no of people per semester of each campus. 
    - Average no of students in every batch of each campus.
3. After the profiling, you should have identified the anomalies and data cleansing issues. Here are some of the issues you need to address during your cleansing work.
    - Separate first and last names both for the student and father. Standardize the first and last names. (Hint: Find all the unique names in data and create a lookup table with two columns i.e. correct_name, variation etc. Use the lookup table to update the name fields with standardized names. Never hardcode names in your SQL.)
    - Introduce a new column i.e. city_name. It will involve extracting the city from the address and then the standardization of the city names. (Use the city names in the telephone directory as standard)
    - Bring the gender information into a consistent representation. Use ‘M’ and ‘F’ with data type CHAR(1).
    - Since the gender information is missing for the Peshawar campus, you can use the student names to figure out the gender. (Hint: Find all distinct male and female names and create a lookup table)
    - If you have been using VARCHAR or CHAR for date, port the date information into columns with the proper Date data type. The dates should be as per calendar dates.
    - Validate all the dates against the business rules e.g. the DOB should be smaller than the Reg. Date and Graduation Date should be greater than Reg. Date. The data might be having anomalies like exchanged DOB and Reg. Date by mistake etc. Also be careful with invalid dates i.e. 31st Feb. or 29th Feb. in a non-leap year.
    - For some campuses, the degree information is missing. Devise some technique to figure it out and update the rows with empty degree fields. Validate other businessrules for each field. One example is that marks should be in the range 0 to 100 inclusive.


In [323]:
import sqlite3

import pandas as pd

from pathlib import Path

In [324]:
data_dir = Path("Assignment 1 Data Set/")

In [800]:
con = sqlite3.connect('University.db')

df_std = pd.read_sql_query("SELECT * FROM Students", con)
df_reg = pd.read_sql_query("SELECT * FROM Course_Registrations", con)

con.close()

In [801]:
df_std.head()

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus
0,KHR_BS_0,Hussain Ansary,Ubaid Ansary,1974-01-05 00:00:00,M,1994-08-13 00:00:00.000000,A,C,h# 978 Street No.72 Defense Phase 1 KHR,A-Level,BS,KHR
1,KHR_BS_1,Shk. Munir Hussaini,Viqar Hamid Hussaini,1974-12-13 00:00:00,M,1994-08-13 00:00:00.000000,A,C,H# 169 s# 0 Mutian wala Thata,A-Level,BS,KHR
2,KHR_BS_10,Hamna Ansary,Hameed Ansary,1974-04-25 00:00:00,F,1994-08-14 00:00:00.000000,A,C,h# 697 St. # 94 sea site KHR,A-Level,BS,KHR
3,KHR_BS_100,Jabbar Haqqie,Muneer Rai Haqqie,1974-05-22 00:00:00,M,1994-08-18 00:00:00.000000,A,C,H# 504 S No.4 Kumria Quetta,A-Level,BS,KHR
4,KHR_BS_1000,Rana Hayyat Baig,Ghulam Mustafa Baig,1976-02-25 00:00:00,M,1995-08-28 00:00:00.000000,A,C,H No.509 St. No.38 Kishwar heights university...,HSSC,BS,KHR


In [802]:
df_reg.head()

Unnamed: 0,SID,Course,Score,Semester,Year,Discipline,Degree,Campus
0,KHR_BS_0,CS-101,72,Fall,1994-01-01 00:00:00.000000,CS,BS,KHR
1,KHR_BS_0,CS-102,78,Fall,1994-01-01 00:00:00.000000,CS,BS,KHR
2,KHR_BS_0,CS-103,53,Fall,1994-01-01 00:00:00.000000,CS,BS,KHR
3,KHR_BS_0,CS-104,58,Fall,1994-01-01 00:00:00.000000,CS,BS,KHR
4,KHR_BS_0,CS-105,61,Spring,1995-01-01 00:00:00.000000,CS,BS,KHR


# Data Profiling

In [803]:
df_std.describe(include='all')

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus
count,17102,17102,17102,17100,13398,17102,17102,17102,17102,17102,17102,17102
unique,17102,14673,14270,7380,2,442,2,2,15360,12,2,3
top,PEW_BS_1573,Kishwar Pasha,Mudassir Khattak,1981-03-22 00:00:00.000000,M,2003-08-16 00:00:00.000000,A,C,house no.141 S# 16 Haider road Multan,F.Sc.,BS,KHR
freq,1,6,5,8,8935,76,17100,12500,3,2390,14302,8201


In [804]:
df_reg.describe(include='all')

Unnamed: 0,SID,Course,Score,Semester,Year,Discipline,Degree,Campus
count,406200,406200,406200.0,406200,406200,406200,406200,406200
unique,11700,120,,2,16,8,2,3
top,PEW_BS_1573,CS-103,,Spring,1970-01-01 00:00:00.000002,TC,BS,LHR
freq,48,8900,,203100,76200,94200,376800,190800
mean,,,74.460268,,,,,
std,,,14.415844,,,,,
min,,,50.0,,,,,
25%,,,62.0,,,,,
50%,,,74.0,,,,,
75%,,,87.0,,,,,


## No. of Unique Values in Students Table

In [805]:
df_std.nunique().to_frame().T

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus
0,17102,14673,14270,7380,2,442,2,2,15360,12,2,3


## No. of Unique Values in Course Registration Table

In [806]:
df_reg.nunique().to_frame().T

Unnamed: 0,SID,Course,Score,Semester,Year,Discipline,Degree,Campus
0,11700,120,50,2,16,8,2,3


## No. of Nulls in Students Table

In [797]:
pd.DataFrame(df_std.isnull().sum(axis = 0), columns=['Count']).T

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus
Count,0,0,0,2,3704,0,0,0,0,0,0,0



## No. of Nulls in Course Registration Table

In [796]:
pd.DataFrame(df_reg.isnull().sum(axis = 0), columns=['Count']).T

Unnamed: 0,SID,Course,Score,Semester,Year,Discipline,Degree,Campus
Count,0,0,0,0,0,0,0,0


## Invalid Values

Let's try to see if the dates in DoB and Reg Date are valid.

In [770]:
dob = pd.to_datetime(df_std.DoB, errors="coerce")

In [468]:
df_std["Reg Date"] = pd.to_datetime(df_std["Reg Date"])

In [469]:
df_reg.Year = pd.to_datetime(df_reg.Year)

### Following are the invalid Date values in DOB

In [771]:
df_std[dob.isnull()]

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus
516,KHR_BS_1462,Annan Satti,Husain Alam Satti,29/Feb/78,M,1996-08-28 00:00:00.000000,A,C,House No.483 St. No.37 Dehri Thata,F.Sc.,BS,KHR
593,KHR_BS_1531,Hadiqa Abuzar,Abuzar Zeeshan,29/Feb/77,F,1996-08-22 00:00:00.000000,A,C,h no. 292 St. No. 38 Koral Umer Kot,HSSC,BS,KHR
1282,KHR_BS_2151,Affan Baig,Mohammad Sheharyar Baig,29/Feb/79,M,1997-08-16 00:00:00.000000,A,C,house no.568 St. No.52 Zeenat block Karachi,F.Sc.,BS,KHR
2113,KHR_BS_290,Qasim Rehan,Rehan Nabeel,29/Feb/75,M,1994-08-29 00:00:00.000000,A,C,Ho. No. 193 St. No. 14 Chand mari Mirpur Khas,A level,BS,KHR
3631,KHR_BS_4266,Shafaq Ahmed,Ahmed Ajab,29/Feb/82,F,2001-08-21 00:00:00.000000,A,C,Ho. No.405 St. No.57 Doley shah Umer Kot,A level,BS,KHR
4167,KHR_BS_4749,Sumiyya Khanzada,Arif Khanzada,29/Feb/83,F,2001-08-03 00:00:00.000000,A,C,house no. 609 St. No. 98 Shah latif town Karachi,F.Sc.,BS,KHR
4319,KHR_BS_4886,Khushbakht Khanzada,Sh. Sarwar Khanzada,29/Feb/83,F,2002-08-16 00:00:00.000000,A,I,h no. 681 St. No. 4 Pwd Coloney KARACHI,HSSC,BS,KHR
4320,KHR_BS_4887,Shabir Hussaini,Syyed Muazzum Hussaini,29/Feb/83,M,2002-08-19 00:00:00.000000,A,I,Ho. No. 996 St. No. 67 Sultana abad Khr,A level,BS,KHR
4933,KHR_BS_5438,Aga Abdus Sammi Khattak,Abdul Qadeer Ikram Khattak,29/Feb/85,M,2003-08-20 00:00:00.000000,A,I,House No. 694 St. No. 13 Cement factory colone...,F.Sc.,BS,KHR
5623,KHR_BS_6059,Arsalan Chohan,Rabbi Daud Chohan,29/Feb/86,M,2004-08-18 00:00:00.000000,A,I,House No.285 St. No.23 Dhamiyal Chishtian,F.Sc.,BS,KHR


Most of the values have 29 Feb as a date in an year which wasn't a **Leap Year**. We can correct them by decrementing the day.

In [772]:
sum(dob.isnull())

16

## No. of Invalid Dates: 16

## Total No. of Courses: 120

In [129]:
df_reg.Course.nunique()

120

## Total no. of Male vs Female students

In [807]:
num_male = sum(df_std.Gender == 'M')
num_female = sum(df_std.Gender == 'F')

In [808]:
pd.DataFrame.from_dict({"Males": num_male, "Females": num_female}, orient="index", columns=['Count']).T

Unnamed: 0,Males,Females
Count,8935,4463


This doesn't contain the data from Peshawar Campus.

## Total No. of people who has taken more than 5 courses in a semester: 41200

In [89]:
grouped_semester_course = df_reg.groupby(['SID', 'Semester', 'Year']).agg({'Course': ['count']})
grouped_semester_course.columns = ['count']
grouped_semester_course = grouped_semester_course.reset_index()

In [90]:
sum(grouped_semester_course['count'] > 5)

41200

## The relationship between total no. of unique student ID’s and total no. of students



In [810]:
total_std = df_std.SID.count()
unq_ids = df_std.SID.str.extract('(\d+)').nunique()

In [811]:

pd.DataFrame.from_dict({"Unique IDs": unq_ids[0], "Total Students": total_std}, orient="index", columns=['Count']).T

Unnamed: 0,Unique IDs,Total Students
Count,6601,17102


## Average no of people per semester of each campus

In [813]:
grouped_semester_students = df_reg.groupby(['Campus', 'Semester', 'Year']).agg({'SID': ['count']})
grouped_semester_students.columns = ['count']
grouped_semester_students = grouped_semester_students.reset_index()
grouped_semester_students

Unnamed: 0,Campus,Semester,Year,count
0,KHR,Fall,1994-01-01 00:00:00.000000,3600
1,KHR,Fall,1995-01-01 00:00:00.000000,7200
2,KHR,Fall,1996-01-01 00:00:00.000000,7200
3,KHR,Fall,1997-01-01 00:00:00.000000,7200
4,KHR,Fall,1998-01-01 00:00:00.000000,3600
5,KHR,Fall,2001-01-01 00:00:00.000000,1200
6,KHR,Fall,2002-01-01 00:00:00.000000,2400
7,KHR,Fall,2003-01-01 00:00:00.000000,2400
8,KHR,Fall,2004-01-01 00:00:00.000000,2400
9,KHR,Spring,1995-01-01 00:00:00.000000,3600


In [824]:
avg_semester_students = grouped_semester_students.reset_index().groupby(['Campus']).agg({'count': ['mean']})
avg_semester_students.columns = ['Average# Students']
# avg_semester_students = avg_semester_students.reset_index()
avg_semester_students

Unnamed: 0_level_0,Average# Students
Campus,Unnamed: 1_level_1
KHR,4133.333333
LHR,7950.0
PEW,28200.0


# Data Cleansing

- Separate first and last names both for the student and father. Standardize the first and last names.

In [694]:
names_list = df_std.Name.str.split().str
df_std["Std_First_Name"] = names_list[0:-1].str.join(' ')
df_std["Std_Last_Name"] = names_list[-1]

Let's look at the data where first name is empty.

In [695]:
df_std[df_std.Std_First_Name == ""]

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus,Std_First_Name,Std_Last_Name
1037,KHR_BS_1931,Ali,Ali Ejaz,1977-01-12 00:00:00,F,1997-08-23 00:00:00.000000,A,C,house # 354 Street No.71 Odean chowk Multan,FSc,BS,KHR,,Ali
1675,KHR_BS_2505,Namani,Umar Namani,1979-10-11 00:00:00,F,1998-08-12 00:00:00.000000,A,C,h no.680 st. # 86 Kashmir block gulshan iqbal...,HSSC,BS,KHR,,Namani
2435,KHR_BS_319,Niazi,Zubair Niazi,1976-06-08 00:00:00,F,1994-08-08 00:00:00.000000,A,C,House No.404 S# 22 Landhi Karachi,F.Sc.,BS,KHR,,Niazi
2605,KHR_BS_3342,Hussaini,Ismail Munir Hussaini,1979-10-23 00:00:00,F,1999-08-20 00:00:00.000000,A,C,house # 227 st. # 59 Bara qabrustan Usman Kot,FSc,BS,KHR,,Hussaini
4257,KHR_BS_483,Munir,Munir Abdul Mujeeb,1975-04-11 00:00:00,F,1994-08-21 00:00:00.000000,A,C,Ho. no.58 St. # 45 Kamra chowk Karachi,A level,BS,KHR,,Munir
4264,KHR_BS_4836,Jaffery,Aftab Mustansar Jaffery,1982-09-24 00:00:00,F,2002-08-26 00:00:00.000000,A,I,House # 241 street # 54 Cantt. road Sadiqabad,FSc,BS,KHR,,Jaffery
5755,KHR_BS_6178,Muazzam,Muazzam Tahir,1984-07-05 00:00:00,F,2004-08-01 00:00:00.000000,A,I,H# 304 street no.41 Biyaal Sibbi,A-Level,BS,KHR,,Muazzam
6170,KHR_BS_6551,Haq,Mouhammed Wajid Haq,1984-10-13 00:00:00,F,2004-08-01 00:00:00.000000,A,I,h# 16 st. # 25 Shah latif town KHR,A-Level,BS,KHR,,Haq
6879,KHR_MS_1248,Hussaini,Ismail Munir Hussaini,1981-11-21 00:00:00,F,2004-08-26 00:00:00.000000,A,I,h no. 620 s no. 64 Factory road Usman Kot,MS,MS,KHR,,Hussaini
7445,KHR_MS_318,Namani,Umar Namani,1979-09-07 00:00:00,F,2001-08-05 00:00:00.000000,A,C,ho. # 89 street # 26 Bander road khr,MSc,MS,KHR,,Namani


By looking at the data, it seems that Father's name is typed in place of Student's Name by mistake. We can interchange their order, and then try to extract first and last names again.

In [696]:
m = df_std.Std_First_Name == ''
df_std.loc[m, ['Name', 'Father']] = (df.loc[m, ['Father', 'Name']].values)
df_std[df_std.Std_First_Name == ""]

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus,Std_First_Name,Std_Last_Name
1037,KHR_BS_1931,Ali Ejaz,Ali,1977-01-12 00:00:00,F,1997-08-23 00:00:00.000000,A,C,house # 354 Street No.71 Odean chowk Multan,FSc,BS,KHR,,Ali
1675,KHR_BS_2505,Umar Namani,Namani,1979-10-11 00:00:00,F,1998-08-12 00:00:00.000000,A,C,h no.680 st. # 86 Kashmir block gulshan iqbal...,HSSC,BS,KHR,,Namani
2435,KHR_BS_319,Zubair Niazi,Niazi,1976-06-08 00:00:00,F,1994-08-08 00:00:00.000000,A,C,House No.404 S# 22 Landhi Karachi,F.Sc.,BS,KHR,,Niazi
2605,KHR_BS_3342,Ismail Munir Hussaini,Hussaini,1979-10-23 00:00:00,F,1999-08-20 00:00:00.000000,A,C,house # 227 st. # 59 Bara qabrustan Usman Kot,FSc,BS,KHR,,Hussaini
4257,KHR_BS_483,Munir Abdul Mujeeb,Munir,1975-04-11 00:00:00,F,1994-08-21 00:00:00.000000,A,C,Ho. no.58 St. # 45 Kamra chowk Karachi,A level,BS,KHR,,Munir
4264,KHR_BS_4836,Aftab Mustansar Jaffery,Jaffery,1982-09-24 00:00:00,F,2002-08-26 00:00:00.000000,A,I,House # 241 street # 54 Cantt. road Sadiqabad,FSc,BS,KHR,,Jaffery
5755,KHR_BS_6178,Muazzam Tahir,Muazzam,1984-07-05 00:00:00,F,2004-08-01 00:00:00.000000,A,I,H# 304 street no.41 Biyaal Sibbi,A-Level,BS,KHR,,Muazzam
6170,KHR_BS_6551,Mouhammed Wajid Haq,Haq,1984-10-13 00:00:00,F,2004-08-01 00:00:00.000000,A,I,h# 16 st. # 25 Shah latif town KHR,A-Level,BS,KHR,,Haq
6879,KHR_MS_1248,Ismail Munir Hussaini,Hussaini,1981-11-21 00:00:00,F,2004-08-26 00:00:00.000000,A,I,h no. 620 s no. 64 Factory road Usman Kot,MS,MS,KHR,,Hussaini
7445,KHR_MS_318,Umar Namani,Namani,1979-09-07 00:00:00,F,2001-08-05 00:00:00.000000,A,C,ho. # 89 street # 26 Bander road khr,MSc,MS,KHR,,Namani


In [697]:
names_list = df_std.Name.str.split().str
df_std["Std_First_Name"] = names_list[0:-1].str.join(' ')
df_std["Std_Last_Name"] = names_list[-1]

In [698]:
df_std[df_std.Std_First_Name == ""]

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus,Std_First_Name,Std_Last_Name


### Father's First and Last Name

In [699]:
fnames_list = df_std.Father.str.split().str
df_std["Father_First_Name"] = fnames_list[0:-1].str.join(' ')
df_std["Father_Last_Name"] = fnames_list[-1]

If a Father only has `Last Name`, make it his `First Name`.

In [700]:
df_std[df_std.Father_First_Name == '']

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus,Std_First_Name,Std_Last_Name,Father_First_Name,Father_Last_Name
1037,KHR_BS_1931,Ali Ejaz,Ali,1977-01-12 00:00:00,F,1997-08-23 00:00:00.000000,A,C,house # 354 Street No.71 Odean chowk Multan,FSc,BS,KHR,Ali,Ejaz,,Ali
1675,KHR_BS_2505,Umar Namani,Namani,1979-10-11 00:00:00,F,1998-08-12 00:00:00.000000,A,C,h no.680 st. # 86 Kashmir block gulshan iqbal...,HSSC,BS,KHR,Umar,Namani,,Namani
2435,KHR_BS_319,Zubair Niazi,Niazi,1976-06-08 00:00:00,F,1994-08-08 00:00:00.000000,A,C,House No.404 S# 22 Landhi Karachi,F.Sc.,BS,KHR,Zubair,Niazi,,Niazi
2605,KHR_BS_3342,Ismail Munir Hussaini,Hussaini,1979-10-23 00:00:00,F,1999-08-20 00:00:00.000000,A,C,house # 227 st. # 59 Bara qabrustan Usman Kot,FSc,BS,KHR,Ismail Munir,Hussaini,,Hussaini
4257,KHR_BS_483,Munir Abdul Mujeeb,Munir,1975-04-11 00:00:00,F,1994-08-21 00:00:00.000000,A,C,Ho. no.58 St. # 45 Kamra chowk Karachi,A level,BS,KHR,Munir Abdul,Mujeeb,,Munir
4264,KHR_BS_4836,Aftab Mustansar Jaffery,Jaffery,1982-09-24 00:00:00,F,2002-08-26 00:00:00.000000,A,I,House # 241 street # 54 Cantt. road Sadiqabad,FSc,BS,KHR,Aftab Mustansar,Jaffery,,Jaffery
5755,KHR_BS_6178,Muazzam Tahir,Muazzam,1984-07-05 00:00:00,F,2004-08-01 00:00:00.000000,A,I,H# 304 street no.41 Biyaal Sibbi,A-Level,BS,KHR,Muazzam,Tahir,,Muazzam
6170,KHR_BS_6551,Mouhammed Wajid Haq,Haq,1984-10-13 00:00:00,F,2004-08-01 00:00:00.000000,A,I,h# 16 st. # 25 Shah latif town KHR,A-Level,BS,KHR,Mouhammed Wajid,Haq,,Haq
6879,KHR_MS_1248,Ismail Munir Hussaini,Hussaini,1981-11-21 00:00:00,F,2004-08-26 00:00:00.000000,A,I,h no. 620 s no. 64 Factory road Usman Kot,MS,MS,KHR,Ismail Munir,Hussaini,,Hussaini
7445,KHR_MS_318,Umar Namani,Namani,1979-09-07 00:00:00,F,2001-08-05 00:00:00.000000,A,C,ho. # 89 street # 26 Bander road khr,MSc,MS,KHR,Umar,Namani,,Namani


In [701]:
m = df_std.Father_First_Name == ''
df_std.loc[m, ['Father_First_Name', 'Father_Last_Name']] = (df_std.loc[m, ['Father_Last_Name', 'Father_First_Name']].values)
df_std[df_std.Father_First_Name == ""]

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus,Std_First_Name,Std_Last_Name,Father_First_Name,Father_Last_Name


### Name Standardization

In [702]:
all_names = pd.concat([df_std.Std_First_Name, df_std.Std_Last_Name, df_std.Father_First_Name, df_std.Father_Last_Name], ignore_index=True)

In [703]:
len(all_names.unique())

12475

In [709]:
std_name_map = {}

std_name_map.update({"AAdil": 'Aadil'})
std_name_map.update({"AAmer": 'Aamer'})
std_name_map.update({"AAmir": 'Aamir'})
std_name_map.update({"AAqil": 'Aaqil'})
std_name_map.update({"AAmina": 'Aamina'})
std_name_map.update({"Ch\.": 'Choudhary', "Chodhary": "Choudhary"})
std_name_map.update({"M\.": 'Mohammad', "Mohammed": "Mohammad", 
                     "Momd\.": "Mohammad", "Mouhammad": "Mohammad", 
                     "Mouhammed": "Mohammad", "Muhammad": "Mohammad"})
std_name_map.update({"Sh\.": "Sheikh", "Sheik": 'Sheikh', 'Shk\.': 'Sheikh'})
std_name_map.update({"Syed": "Syyed"})

std_name_map

{'AAdil': 'Aadil',
 'AAmer': 'Aamer',
 'AAmir': 'Aamir',
 'AAqil': 'Aaqil',
 'AAmina': 'Aamina',
 'Ch\\.': 'Choudhary',
 'Chodhary': 'Choudhary',
 'M\\.': 'Mohammad',
 'Mohammed': 'Mohammad',
 'Momd\\.': 'Mohammad',
 'Mouhammad': 'Mohammad',
 'Mouhammed': 'Mohammad',
 'Muhammad': 'Mohammad',
 'Sh\\.': 'Sheikh',
 'Sheik': 'Sheikh',
 'Shk\\.': 'Sheikh',
 'Syed': 'Syyed'}

In [710]:
df_std.Std_First_Name = df_std.Std_First_Name.replace(std_name_map, regex=True)
# sorted(df_std.Std_First_Name.unique())

In [711]:
df_std.Std_Last_Name = df_std.Std_Last_Name.replace(std_name_map, regex=True)
df_std.Name = df_std.Name.replace(std_name_map, regex=True)
df_std.Father = df_std.Father.replace(std_name_map, regex=True)
df_std.Father_First_Name = df_std.Father_First_Name.replace(std_name_map, regex=True)
df_std.Father_Last_Name = df_std.Father_Last_Name.replace(std_name_map, regex=True)

---

- Since the gender information is missing for the Peshawar campus, you can use the student names to figure out the gender. 

### Our Approach
We created a simple LSTM DL model which was trained on Student's first names from Karachi and Lahore campus, and it predicted whether a student was 'Male' or 'Female' based on the name.

First we'll create a training dataset for the model from the Lahore and Karachi campuses, and then use the Peshawar campus for prediction.

In [712]:
csv_data = df_std[['Std_First_Name', 'Gender']][~df_std.Gender.isnull()]
csv_data

Unnamed: 0,Std_First_Name,Gender
0,Hussain,M
1,Sheikh Munir,M
2,Hamna,F
3,Jabbar,M
4,Rana Hayyat,M
...,...,...
13397,Rohi,F
13398,Mahtab Emmad,M
13399,Mohammad Latif,M
13400,Iman,F


In [713]:
csv_data.to_csv(data_dir / 'names.csv', index=False)

In [714]:
null_gender = df_std[['Std_First_Name']][df_std.Gender.isnull()]
null_gender

Unnamed: 0,Std_First_Name
11982,Karam
12397,Abdul Ahad
12972,Mehak
12991,Mohammad Abdur Rahim
13402,Midhat Azhar
...,...
17097,Natalia
17098,Agha Abdul kareem
17099,Azhar
17100,Riffat Ajmal Affan


In [715]:
pd.Series(null_gender.Std_First_Name.unique()).to_csv(data_dir / 'names_unlabeled.csv', index=False, header=False)

In [716]:
sum(null_gender.Std_First_Name == "")

0

### Load predicted genders

In [724]:
import json

with open(data_dir / 'genders.json') as json_file:
    pew_gender_map = json.load(json_file)

In [725]:
gender_values = null_gender.Std_First_Name.map(pew_gender_map)

In [727]:
df_std.loc[df_std.Gender.isnull(), 'Gender'] = gender_values

In [728]:
df_std[df_std.Campus == 'PEW']

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus,Std_First_Name,Std_Last_Name,Father_First_Name,Father_Last_Name
13402,PEW_BS_0,Midhat Azhar Durrani,Azhar Durrani,1974-04-14 00:00:00.000000,F,1994-08-26 00:00:00.000000,A,C,H# 447 St. # 5 Air port coloney PSH,A-Level,BS,PEW,Midhat Azhar,Durrani,Azhar,Durrani
13403,PEW_BS_1,Mohammad Abdul Rafay Kazmi,Iffan Kazmi,1974-12-13 00:00:00.000000,M,1994-08-13 00:00:00.000000,A,C,H# 169 s# 0 University coloney PSH,A-Level,BS,PEW,Mohammad Abdul Rafay,Kazmi,Iffan,Kazmi
13404,PEW_BS_10,Ghosia Najum,Najum Yosaf,1974-04-25 00:00:00.000000,F,1994-08-14 00:00:00.000000,A,C,h# 697 St. # 94 Lohi pura Hasanabdal,A-Level,BS,PEW,Ghosia,Najum,Najum,Yosaf
13405,PEW_BS_100,Shuja Taimor Kharasani,Sahir Kharasani,1974-05-22 00:00:00.000000,M,1994-08-18 00:00:00.000000,A,C,H# 504 S No.4 Bara market PSH,A-Level,BS,PEW,Shuja Taimor,Kharasani,Sahir,Kharasani
13406,PEW_BS_1000,Kashif Zaidi,Zaran Zaidi,1978-02-25 00:00:00.000000,M,1997-08-28 00:00:00.000000,A,C,H No.509 St. No.38 University coloney PESHAWAR,HSSC,BS,PEW,Kashif,Zaidi,Zaran,Zaidi
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17097,PEW_MS_95,Natalia Baqir,Baqir Abdul Rafay,1979-03-15 00:00:00.000000,F,2001-01-08 00:00:00.000000,A,C,House No. 699 Street # 59 Chand mari Gilgit,BS,MS,PEW,Natalia,Baqir,Baqir Abdul,Rafay
17098,PEW_MS_96,Agha Abdul kareem Abbasi,Ghazi Abbasi,1978-11-15 00:00:00.000000,M,2001-05-08 00:00:00.000000,A,C,H No. 14 s no. 55 Bara market PESHAWAR,MS,MS,PEW,Agha Abdul kareem,Abbasi,Ghazi,Abbasi
17099,PEW_MS_97,Azhar Durrani,Ibraheem Durrani,1978-11-17 00:00:00.000000,M,2001-08-08 00:00:00.000000,A,C,Ho. no. 329 s no. 50 Khybar road Psh,M.Sc,MS,PEW,Azhar,Durrani,Ibraheem,Durrani
17100,PEW_MS_98,Riffat Ajmal Affan Ansary,Ajmal Affan Ansary,1978-08-20 00:00:00.000000,F,2001-11-08 00:00:00.000000,A,C,h no.875 St. no.45 Purani Taal Noshera,MS,MS,PEW,Riffat Ajmal Affan,Ansary,Ajmal Affan,Ansary


In [729]:
sum(df_std.Gender.isnull())

0

### Each student has been assigned a gender.

---

- Introduce a new column i.e. city_name. It will involve extracting the city from the address and then the standardization of the city names. (Use the city names in the telephone directory as standard)

In [730]:
city_names = df_std.Address.str.split().str[-1]

In [731]:
last_two_city_names = df_std.Address.str.split().str[-2:].str.join(' ')
# sorted(last_two_city_names.unique())

In [732]:
sorted(city_names.unique())

['Abbotabad',
 'Attock',
 'Bahawalpur',
 'Bhatian',
 'Bhawalpur',
 'Chishtian',
 'Faisalabad',
 'Gilgit',
 'Gujrat',
 'Gujrawala',
 'Hakim',
 'Haripur',
 'Hasanabdal',
 'Hyderabad',
 'Jehlum',
 'KARACHI',
 'KHR',
 'Kabirwala',
 'Karachi',
 'Khan',
 'Kharian',
 'Khas',
 'Khr',
 'Kot',
 'LalaMusa',
 'Lyyah',
 'Mansehra',
 'Mardan',
 'Melsi',
 'Multan',
 'Noshera',
 'PESHAWAR',
 'PSH',
 'Peshawar',
 'Psh',
 'Quetta',
 'Sadiqabad',
 'Sahiwal',
 'Sargodha',
 'Sawat',
 'Shekhopura',
 'Sialkot',
 'Sibbi',
 'Thata',
 'Wahari',
 'Wazirabad',
 'karachi',
 'khan',
 'khr',
 'peshawar',
 'psh']

In [733]:
std_city_map = {}
std_city_map.update(
    dict.fromkeys(
        [
            'KHR', 
            'KARACHI', 
            'Khr', 
            'Karachi', 
            'khr', 
            'karachi'
            ], 'Karachi'))

std_city_map.update(
    dict.fromkeys(
        [
            'PSH',
            'Psh',
            'Peshawar',
            'PESHAWAR',
            'peshawar',
            'psh'
        ], 'Peshawar'
    )
)

std_city_map.update({'Abbotabad': 'Abottabad'})
std_city_map.update({'Bahawalpur': 'Bahawalpur', 'Bhawalpur': 'Bahawalpur'})
std_city_map.update({'Bhatian': 'Pindi Bhattian'})
std_city_map.update({'D.G. Khan': 'Dera Ghazi Khan', 'ghazi khan': 'Dera Ghazi Khan'})
std_city_map.update({'D.I. Khan': 'D.I.Khan', 'Ismail Khan': 'D.I.Khan'})
std_city_map.update({'Gujrawala': 'Gujranwala'})
std_city_map.update({'Hakim': 'Abdul Hakim'})
std_city_map.update({'Hasanabdal': 'Hasan Abdal'})
std_city_map.update({'Jehlum': 'Jhelum'})
std_city_map.update({'Khas': 'Mirpur Khas'})
std_city_map.update({'Lyyah': 'Layyah'})
std_city_map.update({'Melsi': 'Mailsi'})
std_city_map.update({'Noshera': 'Nowshera'})
std_city_map.update({'Sawat': 'Swat'})
std_city_map.update({'Shekhopura': 'Sheikhupura'})
std_city_map.update({'Sibbi': 'Sibi'})
std_city_map.update({'Thata': 'Thatta'})
std_city_map.update({'Wahari': 'Vehari'})

In [735]:
two_words_city_name = ['Khan', 'Kot']

In [736]:
# Get the previous word for these, in order to figure out the city
city_names[city_names == 'Khan'] = last_two_city_names[city_names == 'Khan']
city_names[city_names == 'khan'] = last_two_city_names[city_names == 'khan']
city_names[city_names == 'Kot'] = last_two_city_names[city_names == 'Kot']

In [738]:
std_city_names = city_names.replace(std_city_map)

In [739]:
sorted(std_city_names.unique())

['Abdul Hakim',
 'Abottabad',
 'Attock',
 'Bahawalpur',
 'Chishtian',
 'D.I.Khan',
 'Dera Ghazi Khan',
 'Faisalabad',
 'Gilgit',
 'Gujranwala',
 'Gujrat',
 'Haripur',
 'Hasan Abdal',
 'Hyderabad',
 'Jhelum',
 'Kabirwala',
 'Karachi',
 'Kharian',
 'LalaMusa',
 'Layyah',
 'Mailsi',
 'Mansehra',
 'Mardan',
 'Mirpur Khas',
 'Multan',
 'Nowshera',
 'Peshawar',
 'Pindi Bhattian',
 'Quetta',
 'Sadiqabad',
 'Sahiwal',
 'Sargodha',
 'Sheikhupura',
 'Sialkot',
 'Sibi',
 'Swat',
 'Thatta',
 'Umer Kot',
 'Usman Kot',
 'Vehari',
 'Wazirabad']

In [740]:
df_std['City'] = std_city_names

In [741]:
df_std.head()

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus,Std_First_Name,Std_Last_Name,Father_First_Name,Father_Last_Name,City
0,KHR_BS_0,Hussain Ansary,Ubaid Ansary,1974-01-05 00:00:00,M,1994-08-13 00:00:00.000000,A,C,h# 978 Street No.72 Defense Phase 1 KHR,A-Level,BS,KHR,Hussain,Ansary,Ubaid,Ansary,Karachi
1,KHR_BS_1,Sheikh Munir Hussaini,Viqar Hamid Hussaini,1974-12-13 00:00:00,M,1994-08-13 00:00:00.000000,A,C,H# 169 s# 0 Mutian wala Thata,A-Level,BS,KHR,Sheikh Munir,Hussaini,Viqar Hamid,Hussaini,Thatta
2,KHR_BS_10,Hamna Ansary,Hameed Ansary,1974-04-25 00:00:00,F,1994-08-14 00:00:00.000000,A,C,h# 697 St. # 94 sea site KHR,A-Level,BS,KHR,Hamna,Ansary,Hameed,Ansary,Karachi
3,KHR_BS_100,Jabbar Haqqie,Muneer Rai Haqqie,1974-05-22 00:00:00,M,1994-08-18 00:00:00.000000,A,C,H# 504 S No.4 Kumria Quetta,A-Level,BS,KHR,Jabbar,Haqqie,Muneer Rai,Haqqie,Quetta
4,KHR_BS_1000,Rana Hayyat Baig,Ghulam Mustafa Baig,1976-02-25 00:00:00,M,1995-08-28 00:00:00.000000,A,C,H No.509 St. No.38 Kishwar heights university...,HSSC,BS,KHR,Rana Hayyat,Baig,Ghulam Mustafa,Baig,Karachi


In [742]:
df_std.describe()

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus,Std_First_Name,Std_Last_Name,Father_First_Name,Father_Last_Name,City
count,17102,17102,17102,17100,17102,17102,17102,17102,17102,17102,17102,17102,17102,17102,17102,17102,17102
unique,17102,14642,14230,7380,2,442,2,2,15360,12,2,3,5879,315,6458,316,41
top,PEW_BS_1573,Kishwar Pasha,Amjad Satti,1981-03-22 00:00:00.000000,M,2003-08-16 00:00:00.000000,A,C,house no.141 S# 16 Haider road Multan,F.Sc.,BS,KHR,Firdous,Zaidi,Amjad,Zaidi,Karachi
freq,1,6,5,8,11531,76,17100,12500,3,2390,14302,8201,55,369,62,369,4438


### We have 41 Unique cities

---

- Bring the gender information into a consistent representation. Use ‘M’ and ‘F’ with data type CHAR(1).


We've already achieved this during the data collection phase.

## Date Validation

- If you have been using VARCHAR or CHAR for date, port the date information into columns with the proper Date data type. The dates should be as per calendar dates.

In [743]:
df_std["Reg Date"] = pd.to_datetime(df_std["Reg Date"])

In [744]:
df_reg.Year = pd.to_datetime(df_reg.Year)

In [745]:
df_std.DoB = pd.to_datetime(df_std.DoB)

ParserError: day is out of range for month: 29/Feb/78

Converting `DoB` to `DateTime` gives error because of invalid dates, i.e. `27/Feb/Non-Leap-Year`. We'll fix this in the next section.

---

- Validate all the dates against the business rules e.g. the DOB should be smaller than the Reg. Date and Graduation Date should be greater than Reg. Date. The data might be having anomalies like exchanged DOB and Reg. Date by mistake etc. Also be careful with invalid dates i.e. 31st Feb. or 29th Feb. in a non-leap year.

First, let's extract invalid `DoB` dates.

In [746]:
dob = pd.to_datetime(df_std.DoB, errors="coerce")

In [747]:
df_std[dob.isnull()]

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus,Std_First_Name,Std_Last_Name,Father_First_Name,Father_Last_Name,City
516,KHR_BS_1462,Annan Satti,Husain Alam Satti,29/Feb/78,M,1996-08-28,A,C,House No.483 St. No.37 Dehri Thata,F.Sc.,BS,KHR,Annan,Satti,Husain Alam,Satti,Thatta
593,KHR_BS_1531,Hadiqa Abuzar,Abuzar Zeeshan,29/Feb/77,F,1996-08-22,A,C,h no. 292 St. No. 38 Koral Umer Kot,HSSC,BS,KHR,Hadiqa,Abuzar,Abuzar,Zeeshan,Umer Kot
1282,KHR_BS_2151,Affan Baig,Mohammad Sheharyar Baig,29/Feb/79,M,1997-08-16,A,C,house no.568 St. No.52 Zeenat block Karachi,F.Sc.,BS,KHR,Affan,Baig,Mohammad Sheharyar,Baig,Karachi
2113,KHR_BS_290,Qasim Rehan,Rehan Nabeel,29/Feb/75,M,1994-08-29,A,C,Ho. No. 193 St. No. 14 Chand mari Mirpur Khas,A level,BS,KHR,Qasim,Rehan,Rehan,Nabeel,Mirpur Khas
3631,KHR_BS_4266,Shafaq Ahmed,Ahmed Ajab,29/Feb/82,F,2001-08-21,A,C,Ho. No.405 St. No.57 Doley shah Umer Kot,A level,BS,KHR,Shafaq,Ahmed,Ahmed,Ajab,Umer Kot
4167,KHR_BS_4749,Sumiyya Khanzada,Arif Khanzada,29/Feb/83,F,2001-08-03,A,C,house no. 609 St. No. 98 Shah latif town Karachi,F.Sc.,BS,KHR,Sumiyya,Khanzada,Arif,Khanzada,Karachi
4319,KHR_BS_4886,Khushbakht Khanzada,Sheikh Sarwar Khanzada,29/Feb/83,F,2002-08-16,A,I,h no. 681 St. No. 4 Pwd Coloney KARACHI,HSSC,BS,KHR,Khushbakht,Khanzada,Sheikh Sarwar,Khanzada,Karachi
4320,KHR_BS_4887,Shabir Hussaini,Syyed Muazzum Hussaini,29/Feb/83,M,2002-08-19,A,I,Ho. No. 996 St. No. 67 Sultana abad Khr,A level,BS,KHR,Shabir,Hussaini,Syyed Muazzum,Hussaini,Karachi
4933,KHR_BS_5438,Aga Abdus Sammi Khattak,Abdul Qadeer Ikram Khattak,29/Feb/85,M,2003-08-20,A,I,House No. 694 St. No. 13 Cement factory colone...,F.Sc.,BS,KHR,Aga Abdus Sammi,Khattak,Abdul Qadeer Ikram,Khattak,Chishtian
5623,KHR_BS_6059,Arsalan Chohan,Rabbi Daud Chohan,29/Feb/86,M,2004-08-18,A,I,House No.285 St. No.23 Dhamiyal Chishtian,F.Sc.,BS,KHR,Arsalan,Chohan,Rabbi Daud,Chohan,Chishtian


Except the last two who have missing dates, all invalid dates have `29 Feb` in a Non-Leap year. We can fix it by making the dates `28`.

In [748]:
df_std.loc[dob.isnull(), 'DoB'] = df_std.loc[dob.isnull(), 'DoB'].str.replace('29', '28')

In [749]:
df_std[dob.isnull()]

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus,Std_First_Name,Std_Last_Name,Father_First_Name,Father_Last_Name,City
516,KHR_BS_1462,Annan Satti,Husain Alam Satti,28/Feb/78,M,1996-08-28,A,C,House No.483 St. No.37 Dehri Thata,F.Sc.,BS,KHR,Annan,Satti,Husain Alam,Satti,Thatta
593,KHR_BS_1531,Hadiqa Abuzar,Abuzar Zeeshan,28/Feb/77,F,1996-08-22,A,C,h no. 292 St. No. 38 Koral Umer Kot,HSSC,BS,KHR,Hadiqa,Abuzar,Abuzar,Zeeshan,Umer Kot
1282,KHR_BS_2151,Affan Baig,Mohammad Sheharyar Baig,28/Feb/79,M,1997-08-16,A,C,house no.568 St. No.52 Zeenat block Karachi,F.Sc.,BS,KHR,Affan,Baig,Mohammad Sheharyar,Baig,Karachi
2113,KHR_BS_290,Qasim Rehan,Rehan Nabeel,28/Feb/75,M,1994-08-29,A,C,Ho. No. 193 St. No. 14 Chand mari Mirpur Khas,A level,BS,KHR,Qasim,Rehan,Rehan,Nabeel,Mirpur Khas
3631,KHR_BS_4266,Shafaq Ahmed,Ahmed Ajab,28/Feb/82,F,2001-08-21,A,C,Ho. No.405 St. No.57 Doley shah Umer Kot,A level,BS,KHR,Shafaq,Ahmed,Ahmed,Ajab,Umer Kot
4167,KHR_BS_4749,Sumiyya Khanzada,Arif Khanzada,28/Feb/83,F,2001-08-03,A,C,house no. 609 St. No. 98 Shah latif town Karachi,F.Sc.,BS,KHR,Sumiyya,Khanzada,Arif,Khanzada,Karachi
4319,KHR_BS_4886,Khushbakht Khanzada,Sheikh Sarwar Khanzada,28/Feb/83,F,2002-08-16,A,I,h no. 681 St. No. 4 Pwd Coloney KARACHI,HSSC,BS,KHR,Khushbakht,Khanzada,Sheikh Sarwar,Khanzada,Karachi
4320,KHR_BS_4887,Shabir Hussaini,Syyed Muazzum Hussaini,28/Feb/83,M,2002-08-19,A,I,Ho. No. 996 St. No. 67 Sultana abad Khr,A level,BS,KHR,Shabir,Hussaini,Syyed Muazzum,Hussaini,Karachi
4933,KHR_BS_5438,Aga Abdus Sammi Khattak,Abdul Qadeer Ikram Khattak,28/Feb/85,M,2003-08-20,A,I,House No. 694 St. No. 13 Cement factory colone...,F.Sc.,BS,KHR,Aga Abdus Sammi,Khattak,Abdul Qadeer Ikram,Khattak,Chishtian
5623,KHR_BS_6059,Arsalan Chohan,Rabbi Daud Chohan,28/Feb/86,M,2004-08-18,A,I,House No.285 St. No.23 Dhamiyal Chishtian,F.Sc.,BS,KHR,Arsalan,Chohan,Rabbi Daud,Chohan,Chishtian


In [750]:
dob = pd.to_datetime(df_std.DoB, errors="coerce")
df_std[dob.isnull()]

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus,Std_First_Name,Std_Last_Name,Father_First_Name,Father_Last_Name,City
10649,LHR_BS_3200,Choudhary Abdul Qadir Minhas,Viqar Minhas,,M,2002-08-20,A,I,House No.117 St. no.41 Malik abad Bahawalpur,F.Sc.,BS,LHR,Choudhary Abdul Qadir,Minhas,Viqar,Minhas,Bahawalpur
11692,LHR_BS_414,Pervaiz Sikandar Sherazi,Abdul Mannan Shabeer Sherazi,,M,1995-08-28,A,C,h no.249 St. no.77 Ramzania Wazirabad,HSSC,BS,LHR,Pervaiz Sikandar,Sherazi,Abdul Mannan Shabeer,Sherazi,Wazirabad


To figure out the missing dates, we can deduce it from the `Reg Date`, and take the median DoB of all the students who registered around the same time.

In [751]:
def get_median_dob(reg_date):
    a = pd.Timestamp(reg_date) - pd.Timedelta(5, unit='d')
    b = pd.Timestamp(reg_date) + pd.Timedelta(5, unit='d')
    df2_numeric_format = pd.to_numeric(pd.to_datetime(df_std[df_std["Reg Date"].between(a, b)].DoB))
    median_numeric_format = np.median(df2_numeric_format)
    median_datetime_format = pd.to_datetime(median_numeric_format)
    return median_datetime_format

In [752]:
get_median_dob("1995-08-28")

Timestamp('1976-08-06 00:00:00')

In [753]:
invalid_dates = df_std.loc[dob.isnull()][['DoB', 'Reg Date']]
invalid_dates.DoB = invalid_dates['Reg Date'].apply(get_median_dob)
invalid_dates

Unnamed: 0,DoB,Reg Date
10649,1982-07-25 12:00:00,2002-08-20
11692,1976-08-06 00:00:00,1995-08-28


In [754]:
df_std.loc[dob.isnull(), 'DoB'] = df_std.loc[dob.isnull(), 'Reg Date'].apply(get_median_dob)

In [755]:
dob = pd.to_datetime(df_std.DoB, errors="coerce")
df_std[dob.isnull()]

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus,Std_First_Name,Std_Last_Name,Father_First_Name,Father_Last_Name,City


Now that all dates are valid, we'll change the data type of column to `DateTime`.

In [756]:
df_std.DoB = pd.to_datetime(df_std.DoB)

In [757]:
df_std.head()

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus,Std_First_Name,Std_Last_Name,Father_First_Name,Father_Last_Name,City
0,KHR_BS_0,Hussain Ansary,Ubaid Ansary,1974-01-05,M,1994-08-13,A,C,h# 978 Street No.72 Defense Phase 1 KHR,A-Level,BS,KHR,Hussain,Ansary,Ubaid,Ansary,Karachi
1,KHR_BS_1,Sheikh Munir Hussaini,Viqar Hamid Hussaini,1974-12-13,M,1994-08-13,A,C,H# 169 s# 0 Mutian wala Thata,A-Level,BS,KHR,Sheikh Munir,Hussaini,Viqar Hamid,Hussaini,Thatta
2,KHR_BS_10,Hamna Ansary,Hameed Ansary,1974-04-25,F,1994-08-14,A,C,h# 697 St. # 94 sea site KHR,A-Level,BS,KHR,Hamna,Ansary,Hameed,Ansary,Karachi
3,KHR_BS_100,Jabbar Haqqie,Muneer Rai Haqqie,1974-05-22,M,1994-08-18,A,C,H# 504 S No.4 Kumria Quetta,A-Level,BS,KHR,Jabbar,Haqqie,Muneer Rai,Haqqie,Quetta
4,KHR_BS_1000,Rana Hayyat Baig,Ghulam Mustafa Baig,1976-02-25,M,1995-08-28,A,C,H No.509 St. No.38 Kishwar heights university...,HSSC,BS,KHR,Rana Hayyat,Baig,Ghulam Mustafa,Baig,Karachi


### DOB should be smaller than Reg Date

In [758]:
df_std[df_std.DoB > df_std['Reg Date']]

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus,Std_First_Name,Std_Last_Name,Father_First_Name,Father_Last_Name,City


---

- For some campuses, the degree information is missing. Devise some technique to figure it out and update the rows with empty degree fields.

In [759]:
sum(df_std.Degree.isnull())

0

No Degree information is missing.

---

- Validate other businessrules for each field. One example is that marks should be in the range 0 to 100 inclusive.

In [760]:
df_std.describe(include='all')

Unnamed: 0,SID,Name,Father,DoB,Gender,Reg Date,Reg Status,Degree Status,Address,Qualification,Degree,Campus,Std_First_Name,Std_Last_Name,Father_First_Name,Father_Last_Name,City
count,17102,17102,17102,17102,17102,17102,17102,17102,17102,17102,17102,17102,17102,17102,17102,17102,17102
unique,17102,14642,14230,4085,2,442,2,2,15360,12,2,3,5879,315,6458,316,41
top,PEW_BS_1573,Kishwar Pasha,Amjad Satti,1982-03-28 00:00:00,M,2003-08-16 00:00:00,A,C,house no.141 S# 16 Haider road Multan,F.Sc.,BS,KHR,Firdous,Zaidi,Amjad,Zaidi,Karachi
freq,1,6,5,14,11531,76,17100,12500,3,2390,14302,8201,55,369,62,369,4438
first,,,,1974-01-01 00:00:00,,1994-01-08 00:00:00,,,,,,,,,,,
last,,,,1986-12-29 00:00:00,,2005-01-12 00:00:00,,,,,,,,,,,


In [761]:
df_reg.Score.describe(include='all')

count    406200.000000
mean         74.460268
std          14.415844
min          50.000000
25%          62.000000
50%          74.000000
75%          87.000000
max          99.000000
Name: Score, dtype: float64

# Save Cleaned Data in DB

In [762]:
from sqlalchemy import create_engine
from sqlalchemy.types import Integer, Text, String, DateTime, Boolean, CHAR

In [763]:
df_reg.columns

Index(['SID', 'Course', 'Score', 'Semester', 'Year', 'Discipline', 'Degree',
       'Campus'],
      dtype='object')

In [764]:
std_col_types = {
    "SID": Text,
    "Name": Text,
    "Father": Text,
    "DoB":  DateTime,
    "Gender": CHAR(1),
    "Reg Date": DateTime,
    "Reg Status": CHAR(1),
    "Degree Status": CHAR(1),
    "Address": Text,
    "Qualification": String(8),
    "Degree": CHAR(2),
    "Campus": CHAR(2),
    "Std_First_Name": Text,
    "Std_Last_Name": Text,
    "Father_First_Name": Text,
    "Father_Last_Name": Text
    }

In [765]:
reg_col_types = {
    "SID": Text,
    "Courses": String(8),
    "Score": Integer,
    "Semester":  String(10),
    "Year": DateTime,
    "Discipline": String(3),
    "Degree": CHAR(2),
    "Campus": CHAR(2)
    }

In [766]:
engine = create_engine('sqlite:///University.db', echo=True)
sqlite_connection = engine.connect()

std_table = "CLN_Students"
df_std.to_sql(
    std_table, 
    sqlite_connection, 
    if_exists='replace', 
    index=False, 
    dtype=std_col_types)


reg_table = "CLN_Course_Registrations"
df_reg.to_sql(
    reg_table, 
    sqlite_connection, 
    if_exists='replace', 
    index=False, 
    dtype=reg_col_types)


sqlite_connection.close()

2020-11-16 16:11:08,724 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
2020-11-16 16:11:08,725 INFO sqlalchemy.engine.base.Engine ()
2020-11-16 16:11:08,726 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
2020-11-16 16:11:08,727 INFO sqlalchemy.engine.base.Engine ()
2020-11-16 16:11:08,734 INFO sqlalchemy.engine.base.Engine PRAGMA main.table_info("CLN_Students")
2020-11-16 16:11:08,735 INFO sqlalchemy.engine.base.Engine ()
2020-11-16 16:11:08,737 INFO sqlalchemy.engine.base.Engine PRAGMA main.table_info("CLN_Students")
2020-11-16 16:11:08,738 INFO sqlalchemy.engine.base.Engine ()
2020-11-16 16:11:08,740 INFO sqlalchemy.engine.base.Engine SELECT name FROM sqlite_master WHERE type='table' ORDER BY name
2020-11-16 16:11:08,741 INFO sqlalchemy.engine.base.Engine ()
2020-11-16 16:11:08,743 INFO sqlalchemy.engine.base.Engine PRAGMA main.table_xinfo("CLN_Students")
2020-11-16 16:11:08,744 INFO sqlal