## Fixing Student Info

This solution will handle slightly more complex data cleaning issues. 

Next Steps...

In [1]:
import pandas as pd
import numpy as np


In [2]:
df = pd.read_excel('data/address_data.xlsx')
df.head()

Unnamed: 0,Unique ID,Gardian Name,Email,Address,Scholar Name,Scholar Grade,Scholar 1 Enrollment Status
0,7356714015,London Cameron,rasca@outlook.com,2856 Rhonda Lane Denver CO 80207,Atticus Jackson,K,Application Submitted
1,3233233322,Marley Allen,Marley.Allen@gmail.com,1638 E Washington St Denver CO 80219,Carlo Huang,Second,Currently Enrolled
2,4344344433,Yair Harrington,Yair.Harrington@gmail.com,628 n Lafayette st. Denver CO 80219,Mitchell Castillo,Second,Currently Enrolled
3,5467891234,Ashanti Huynh,Ashanti.Huynh@gmail.com,58 S 2nd street Denver CO 80219,Melina Ward,Second,Currently Enrolled
4,7899877799,Ashanti Sellers,Ashanti.Sellers@gmail.com,58 S 2nd Street Denver CO 80219,Braden McClain,Second,Currently Enrolled


## What We Know About This Dataset
We already know that all items in the 'Address' column are in Denver CO and are properly formatted. We also know that all names in the 'Gardian Name' and 'Scholar Name' columns have only a first and last name. Finally we know that we have limited errors to correct in th. 'Scholar Grade Column'

We can add some additional students to make our first solution way more reusable.

We will add students that have the following:
1. Multiple names
2. Parents with multiple names
3. Addresses not in Denver Co
4. Addresses with two lines
5. A variety of grades / errors in the 'Scholar Grade Column'

In [3]:
## Adding 2 rows of students to data frame that have the variations listed above
df.loc[len(df.index)] = ['9009009000','Amy Jo Smith', 'aj.smith@gmail.com', 
                         '111 E Emerson St Commerce City CO 80603','Alex Smith Jones','Third','Currently Enrolled'] 
df.loc[len(df.index)] = ['9009009001','Malik Mays', 'malik.mays@gmail.com', 
                         '493 First St Unit A Brighton CO 80601','Tonya Mays','4','Currently Enrolled'] 
#display the last 5
df.tail()

Unnamed: 0,Unique ID,Gardian Name,Email,Address,Scholar Name,Scholar Grade,Scholar 1 Enrollment Status
186,5239073231,Bethany Moon,wagnerch@mac.com,2856 Rhonda Lane Denver CO 80208,Kaiden Mayer,2nd,Welcome Call Complete
187,1014068131,Stella Myers,s.myers@randatmail.com,827 s 10 st Denver CO 80219,Marcus Spencer,1st,Welcome Call Complete
188,9546297375,Rowan Jefferson,druschel@outlook.com,618 N. New St. Denver CO 80202,Alana Porter,1st,Welcome Call Complete
189,9009009000,Amy Jo Smith,aj.smith@gmail.com,111 E Emerson St Commerce City CO 80603,Alex Smith Jones,Third,Currently Enrolled
190,9009009001,Malik Mays,malik.mays@gmail.com,493 First St Unit A Brighton CO 80601,Tonya Mays,4,Currently Enrolled


## Make A Copy
I am going to make a copy of the dataframe to make it easier to show how the old code would handle the new rows of data.

In [4]:
## copy df
df_adv = df.copy()
#show the last 5
df_adv.tail()

Unnamed: 0,Unique ID,Gardian Name,Email,Address,Scholar Name,Scholar Grade,Scholar 1 Enrollment Status
186,5239073231,Bethany Moon,wagnerch@mac.com,2856 Rhonda Lane Denver CO 80208,Kaiden Mayer,2nd,Welcome Call Complete
187,1014068131,Stella Myers,s.myers@randatmail.com,827 s 10 st Denver CO 80219,Marcus Spencer,1st,Welcome Call Complete
188,9546297375,Rowan Jefferson,druschel@outlook.com,618 N. New St. Denver CO 80202,Alana Porter,1st,Welcome Call Complete
189,9009009000,Amy Jo Smith,aj.smith@gmail.com,111 E Emerson St Commerce City CO 80603,Alex Smith Jones,Third,Currently Enrolled
190,9009009001,Malik Mays,malik.mays@gmail.com,493 First St Unit A Brighton CO 80601,Tonya Mays,4,Currently Enrolled


## Name Columns

Change the Guardian Name and Scholar Name Columns to First and Last name columns for both Scholar and Guardian.
We will use the old method first on the 'df' dataframe.

In [5]:
def first_name(name):
    result = name.split()[0]
    return result
def last_name(name):
    result = name.split()[-1]
    return result
#add columns and split the names
df['Guardian_First'] = df['Gardian Name'].apply(lambda x: first_name(x))
df['Guardian_last'] = df['Gardian Name'].apply(lambda x: last_name(x))
df['Scholar_First'] = df['Scholar Name'].apply(lambda x: first_name(x))
df['Scholar_Last'] = df['Scholar Name'].apply(lambda x: last_name(x))
#drop original columns
df = df.drop(['Gardian Name','Scholar Name'], axis=1)
#show last 5
df.tail()

Unnamed: 0,Unique ID,Email,Address,Scholar Grade,Scholar 1 Enrollment Status,Guardian_First,Guardian_last,Scholar_First,Scholar_Last
186,5239073231,wagnerch@mac.com,2856 Rhonda Lane Denver CO 80208,2nd,Welcome Call Complete,Bethany,Moon,Kaiden,Mayer
187,1014068131,s.myers@randatmail.com,827 s 10 st Denver CO 80219,1st,Welcome Call Complete,Stella,Myers,Marcus,Spencer
188,9546297375,druschel@outlook.com,618 N. New St. Denver CO 80202,1st,Welcome Call Complete,Rowan,Jefferson,Alana,Porter
189,9009009000,aj.smith@gmail.com,111 E Emerson St Commerce City CO 80603,Third,Currently Enrolled,Amy,Smith,Alex,Jones
190,9009009001,malik.mays@gmail.com,493 First St Unit A Brighton CO 80601,4,Currently Enrolled,Malik,Mays,Tonya,Mays


### results:
AS you can see we now have issues with the name columns as we would expcet to see 'Amy Jo' for 'Guardian First' and 'Smith Jones' for 'Scholar Last' for row 189.

## A more advanced solution
Let's try to handle these descrepencies in a different manner.

A new function can check to see if the name does not fit our standard format and then prompt the user to choose a first and last name for the name field and then return that value with a comma in between the values.  

The name situation would be much easier to control by having a better data entry system as it is impossible to get this process completly correct.  
### Testing the new function below

In [6]:
##new function for name columns
def name_handler(name):
    if len(name.split())==2:
        new_name = ",".join(name.split())
    else: 
        print('\n')
        print('Name parts need identification')
        print(name)
        first_name = input("Enter the First Name: ")
        last_name = input("Enter the last Name: ")
        new_name = (",".join([first_name, last_name]))
    return new_name

test_names =['Jane Doe','John David Doe']
for name in test_names:
    print("original:{}\t\t new:{}".format(name, name_handler(name)))

original:Jane Doe		 new:Jane,Doe


Name parts need identification
John David Doe
Enter the First Name: John David
Enter the last Name: Doe
original:John David Doe		 new:John David,Doe


### Apply function to the name columns
We will break this down to into 2 steps we will add the commas to the names and then update our old functions to create first and last columns for the 'Scholar Name' and 'Guardian Name'

In [7]:
##fix spelling of 'Gardian Name'
df_adv.rename(columns={'Gardian Name':'Guardian Name'}, inplace=True)
df_adv.head()

Unnamed: 0,Unique ID,Guardian Name,Email,Address,Scholar Name,Scholar Grade,Scholar 1 Enrollment Status
0,7356714015,London Cameron,rasca@outlook.com,2856 Rhonda Lane Denver CO 80207,Atticus Jackson,K,Application Submitted
1,3233233322,Marley Allen,Marley.Allen@gmail.com,1638 E Washington St Denver CO 80219,Carlo Huang,Second,Currently Enrolled
2,4344344433,Yair Harrington,Yair.Harrington@gmail.com,628 n Lafayette st. Denver CO 80219,Mitchell Castillo,Second,Currently Enrolled
3,5467891234,Ashanti Huynh,Ashanti.Huynh@gmail.com,58 S 2nd street Denver CO 80219,Melina Ward,Second,Currently Enrolled
4,7899877799,Ashanti Sellers,Ashanti.Sellers@gmail.com,58 S 2nd Street Denver CO 80219,Braden McClain,Second,Currently Enrolled


In [8]:
## fix 'Guardian Name'
df_adv['Guardian Name'] = df_adv['Guardian Name'].apply(lambda x: name_handler(x))
df_adv.tail()



Name parts need identification
Chad Michael Murray
Enter the First Name: Chad Michael
Enter the last Name: Murray


Name parts need identification
Amy Jo Smith
Enter the First Name: Amy Jo
Enter the last Name: Smith


Unnamed: 0,Unique ID,Guardian Name,Email,Address,Scholar Name,Scholar Grade,Scholar 1 Enrollment Status
186,5239073231,"Bethany,Moon",wagnerch@mac.com,2856 Rhonda Lane Denver CO 80208,Kaiden Mayer,2nd,Welcome Call Complete
187,1014068131,"Stella,Myers",s.myers@randatmail.com,827 s 10 st Denver CO 80219,Marcus Spencer,1st,Welcome Call Complete
188,9546297375,"Rowan,Jefferson",druschel@outlook.com,618 N. New St. Denver CO 80202,Alana Porter,1st,Welcome Call Complete
189,9009009000,"Amy Jo,Smith",aj.smith@gmail.com,111 E Emerson St Commerce City CO 80603,Alex Smith Jones,Third,Currently Enrolled
190,9009009001,"Malik,Mays",malik.mays@gmail.com,493 First St Unit A Brighton CO 80601,Tonya Mays,4,Currently Enrolled


In [9]:
## fix 'Scholar Name'
df_adv['Scholar Name'] = df_adv['Scholar Name'].apply(lambda x: name_handler(x))
df_adv.tail()



Name parts need identification
Alex Smith Jones
Enter the First Name: Alex
Enter the last Name: Smith Jones


Unnamed: 0,Unique ID,Guardian Name,Email,Address,Scholar Name,Scholar Grade,Scholar 1 Enrollment Status
186,5239073231,"Bethany,Moon",wagnerch@mac.com,2856 Rhonda Lane Denver CO 80208,"Kaiden,Mayer",2nd,Welcome Call Complete
187,1014068131,"Stella,Myers",s.myers@randatmail.com,827 s 10 st Denver CO 80219,"Marcus,Spencer",1st,Welcome Call Complete
188,9546297375,"Rowan,Jefferson",druschel@outlook.com,618 N. New St. Denver CO 80202,"Alana,Porter",1st,Welcome Call Complete
189,9009009000,"Amy Jo,Smith",aj.smith@gmail.com,111 E Emerson St Commerce City CO 80603,"Alex,Smith Jones",Third,Currently Enrolled
190,9009009001,"Malik,Mays",malik.mays@gmail.com,493 First St Unit A Brighton CO 80601,"Tonya,Mays",4,Currently Enrolled


### results:
We now have comma separated names and found a name with three parts that went unonticed in spread sheet solution.  That is easy to do and a big reason to use pandas over excel or google sheets.

## Create our First / Last name columns for Guardian and Scholar
We can now use our same functions as before but add a ',' as the point to split the names.  

In [10]:
def first_name_comma(name):
    result = name.split(',')[0]
    return result
def last_name_comma(name):
    result = name.split(',')[-1]
    return result

#add columns and split the names
df_adv['Guardian_First'] = df_adv['Guardian Name'].apply(lambda x: first_name_comma(x))
df_adv['Guardian_last'] = df_adv['Guardian Name'].apply(lambda x: last_name_comma(x))
df_adv['Scholar_First'] = df_adv['Scholar Name'].apply(lambda x: first_name_comma(x))
df_adv['Scholar_Last'] = df_adv['Scholar Name'].apply(lambda x: last_name_comma(x))
#drop original columns
df_adv = df_adv.drop(['Guardian Name','Scholar Name'], axis=1)
#show last 5
df_adv.tail()

Unnamed: 0,Unique ID,Email,Address,Scholar Grade,Scholar 1 Enrollment Status,Guardian_First,Guardian_last,Scholar_First,Scholar_Last
186,5239073231,wagnerch@mac.com,2856 Rhonda Lane Denver CO 80208,2nd,Welcome Call Complete,Bethany,Moon,Kaiden,Mayer
187,1014068131,s.myers@randatmail.com,827 s 10 st Denver CO 80219,1st,Welcome Call Complete,Stella,Myers,Marcus,Spencer
188,9546297375,druschel@outlook.com,618 N. New St. Denver CO 80202,1st,Welcome Call Complete,Rowan,Jefferson,Alana,Porter
189,9009009000,aj.smith@gmail.com,111 E Emerson St Commerce City CO 80603,Third,Currently Enrolled,Amy Jo,Smith,Alex,Smith Jones
190,9009009001,malik.mays@gmail.com,493 First St Unit A Brighton CO 80601,4,Currently Enrolled,Malik,Mays,Tonya,Mays


## Fix the Address Column

Going to try using usaaddress and/or scourgify to create my address columns.

The original solution should still work. But we will test on both dataframes

In [11]:
import usaddress

In [12]:
# Parse and label an address
address = "58 S 2nd Street Denver CO 80219"
parsed_address = usaddress.parse(address)

# Print the parsed address
print(parsed_address)

[('58', 'AddressNumber'), ('S', 'StreetNamePreDirectional'), ('2nd', 'StreetName'), ('Street', 'StreetNamePostType'), ('Denver', 'PlaceName'), ('CO', 'StateName'), ('80219', 'ZipCode')]


This could work but let's look at scourgify which is built on top of usaddress and will do some of lifting for us.

In [13]:
#scourgify https://github.com/GreenBuildingRegistry/usaddress-scourgify
from scourgify import normalize_address_record

record = normalize_address_record('58 s 2nd Street Denver CO 80219')

record

{'address_line_1': '58 S 2ND ST',
 'address_line_2': None,
 'city': 'DENVER',
 'state': 'CO',
 'postal_code': '80219'}

In [14]:
record['address_line_1']

'58 S 2ND ST'

In [15]:
print(record['address_line_2'])

None


This looks like a good starting point. We can now build a function convert the address column into it's individual parts.

In [16]:
## function 
def address_separator(a):
    record = normalize_address_record(a)
    #check second address line
    add2 = ""
    if record['address_line_2']:
        add2=record['address_line_2']
    #create parts of the address
    street = " ".join([record['address_line_1'],add2])
    city = record['city']
    state = record['state']
    zip = record['postal_code']
    return [street, city, state, zip]

address_separator('58 s 2nd Street Denver CO 80219')

['58 S 2ND ST ', 'DENVER', 'CO', '80219']

The first step will be to temporarily create a column that has a list of the parts.

In [17]:
#apply this function to each row creating address part column
df['add_parts'] = df['Address'].apply(lambda row: address_separator(row))

df.head()

Unnamed: 0,Unique ID,Email,Address,Scholar Grade,Scholar 1 Enrollment Status,Guardian_First,Guardian_last,Scholar_First,Scholar_Last,add_parts
0,7356714015,rasca@outlook.com,2856 Rhonda Lane Denver CO 80207,K,Application Submitted,London,Cameron,Atticus,Jackson,"[2856 RHONDA LN , DENVER, CO, 80207]"
1,3233233322,Marley.Allen@gmail.com,1638 E Washington St Denver CO 80219,Second,Currently Enrolled,Marley,Allen,Carlo,Huang,"[1638 E WASHINGTON ST , DENVER, CO, 80219]"
2,4344344433,Yair.Harrington@gmail.com,628 n Lafayette st. Denver CO 80219,Second,Currently Enrolled,Yair,Harrington,Mitchell,Castillo,"[628 N LAFAYETTE ST , DENVER, CO, 80219]"
3,5467891234,Ashanti.Huynh@gmail.com,58 S 2nd street Denver CO 80219,Second,Currently Enrolled,Ashanti,Huynh,Melina,Ward,"[58 S 2ND ST , DENVER, CO, 80219]"
4,7899877799,Ashanti.Sellers@gmail.com,58 S 2nd Street Denver CO 80219,Second,Currently Enrolled,Ashanti,Sellers,Braden,McClain,"[58 S 2ND ST , DENVER, CO, 80219]"


create columns from the add_parts

In [18]:
# assign values to columns
df['street'] = df['add_parts'].apply(lambda x: x[0])
df['city'] = df['add_parts'].apply(lambda x: x[1])
df['state']= df['add_parts'].apply(lambda x: x[2])
df['zip'] = df['add_parts'].apply(lambda x: x[3])
df.head()

Unnamed: 0,Unique ID,Email,Address,Scholar Grade,Scholar 1 Enrollment Status,Guardian_First,Guardian_last,Scholar_First,Scholar_Last,add_parts,street,city,state,zip
0,7356714015,rasca@outlook.com,2856 Rhonda Lane Denver CO 80207,K,Application Submitted,London,Cameron,Atticus,Jackson,"[2856 RHONDA LN , DENVER, CO, 80207]",2856 RHONDA LN,DENVER,CO,80207
1,3233233322,Marley.Allen@gmail.com,1638 E Washington St Denver CO 80219,Second,Currently Enrolled,Marley,Allen,Carlo,Huang,"[1638 E WASHINGTON ST , DENVER, CO, 80219]",1638 E WASHINGTON ST,DENVER,CO,80219
2,4344344433,Yair.Harrington@gmail.com,628 n Lafayette st. Denver CO 80219,Second,Currently Enrolled,Yair,Harrington,Mitchell,Castillo,"[628 N LAFAYETTE ST , DENVER, CO, 80219]",628 N LAFAYETTE ST,DENVER,CO,80219
3,5467891234,Ashanti.Huynh@gmail.com,58 S 2nd street Denver CO 80219,Second,Currently Enrolled,Ashanti,Huynh,Melina,Ward,"[58 S 2ND ST , DENVER, CO, 80219]",58 S 2ND ST,DENVER,CO,80219
4,7899877799,Ashanti.Sellers@gmail.com,58 S 2nd Street Denver CO 80219,Second,Currently Enrolled,Ashanti,Sellers,Braden,McClain,"[58 S 2ND ST , DENVER, CO, 80219]",58 S 2ND ST,DENVER,CO,80219


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 0 to 190
Data columns (total 14 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Unique ID                    191 non-null    object
 1   Email                        187 non-null    object
 2   Address                      191 non-null    object
 3   Scholar Grade                188 non-null    object
 4   Scholar 1 Enrollment Status  191 non-null    object
 5   Guardian_First               191 non-null    object
 6   Guardian_last                191 non-null    object
 7   Scholar_First                191 non-null    object
 8   Scholar_Last                 191 non-null    object
 9   add_parts                    191 non-null    object
 10  street                       191 non-null    object
 11  city                         191 non-null    object
 12  state                        191 non-null    object
 13  zip                          191 no

In [20]:
df = df.drop(['Address','add_parts'],axis=1)
df.head()

Unnamed: 0,Unique ID,Email,Scholar Grade,Scholar 1 Enrollment Status,Guardian_First,Guardian_last,Scholar_First,Scholar_Last,street,city,state,zip
0,7356714015,rasca@outlook.com,K,Application Submitted,London,Cameron,Atticus,Jackson,2856 RHONDA LN,DENVER,CO,80207
1,3233233322,Marley.Allen@gmail.com,Second,Currently Enrolled,Marley,Allen,Carlo,Huang,1638 E WASHINGTON ST,DENVER,CO,80219
2,4344344433,Yair.Harrington@gmail.com,Second,Currently Enrolled,Yair,Harrington,Mitchell,Castillo,628 N LAFAYETTE ST,DENVER,CO,80219
3,5467891234,Ashanti.Huynh@gmail.com,Second,Currently Enrolled,Ashanti,Huynh,Melina,Ward,58 S 2ND ST,DENVER,CO,80219
4,7899877799,Ashanti.Sellers@gmail.com,Second,Currently Enrolled,Ashanti,Sellers,Braden,McClain,58 S 2ND ST,DENVER,CO,80219


## Repeat and Test with df_adv
We will see if our new address affect this process.

In [21]:
#apply this function to each row creating address part column
df_adv['add_parts'] = df_adv['Address'].apply(lambda row: address_separator(row))

# assign values to columns
df_adv['street'] = df_adv['add_parts'].apply(lambda x: x[0])
df_adv['city'] = df_adv['add_parts'].apply(lambda x: x[1])
df_adv['state']= df_adv['add_parts'].apply(lambda x: x[2])
df_adv['zip'] = df_adv['add_parts'].apply(lambda x: x[3])

#drop columns
df_adv = df_adv.drop(['Address','add_parts'],axis=1)
#show last 5
df.tail()

Unnamed: 0,Unique ID,Email,Scholar Grade,Scholar 1 Enrollment Status,Guardian_First,Guardian_last,Scholar_First,Scholar_Last,street,city,state,zip
186,5239073231,wagnerch@mac.com,2nd,Welcome Call Complete,Bethany,Moon,Kaiden,Mayer,2856 RHONDA LN,DENVER,CO,80208
187,1014068131,s.myers@randatmail.com,1st,Welcome Call Complete,Stella,Myers,Marcus,Spencer,827 S 10TH ST,DENVER,CO,80219
188,9546297375,druschel@outlook.com,1st,Welcome Call Complete,Rowan,Jefferson,Alana,Porter,618 N NEW ST,DENVER,CO,80202
189,9009009000,aj.smith@gmail.com,Third,Currently Enrolled,Amy,Smith,Alex,Jones,111 E EMERSON ST,COMMERCE CITY,CO,80603
190,9009009001,malik.mays@gmail.com,4,Currently Enrolled,Malik,Mays,Tonya,Mays,493 FIRST ST UNIT A,BRIGHTON,CO,80601


### results:  
It looks like the addresses worked without needing to many adjustments.  We needed to tweak the formula for street address to use join and add a space.

## Fix Scholar Grade

The goal for this column is to get uniform grade labels.  The desired options are 
- Kindergarten
- 1st 
- 2nd
- PreK 3

In [22]:
df.head()

Unnamed: 0,Unique ID,Email,Scholar Grade,Scholar 1 Enrollment Status,Guardian_First,Guardian_last,Scholar_First,Scholar_Last,street,city,state,zip
0,7356714015,rasca@outlook.com,K,Application Submitted,London,Cameron,Atticus,Jackson,2856 RHONDA LN,DENVER,CO,80207
1,3233233322,Marley.Allen@gmail.com,Second,Currently Enrolled,Marley,Allen,Carlo,Huang,1638 E WASHINGTON ST,DENVER,CO,80219
2,4344344433,Yair.Harrington@gmail.com,Second,Currently Enrolled,Yair,Harrington,Mitchell,Castillo,628 N LAFAYETTE ST,DENVER,CO,80219
3,5467891234,Ashanti.Huynh@gmail.com,Second,Currently Enrolled,Ashanti,Huynh,Melina,Ward,58 S 2ND ST,DENVER,CO,80219
4,7899877799,Ashanti.Sellers@gmail.com,Second,Currently Enrolled,Ashanti,Sellers,Braden,McClain,58 S 2ND ST,DENVER,CO,80219


In [23]:
#check the unique_values:
un_list = df['Scholar Grade'].unique().tolist()
un_list

['K',
 'Second',
 '1st',
 '2nd',
 'Kindergarten',
 'First',
 'PreK 3',
 nan,
 'Third',
 '4']

In [24]:
#create a dictionary matching each unique value to the desired string 
#these are only the ones that we need to change
grade_dict = {'K':'Kindergarten','Second':'2nd','First':'1st'}
grade_dict

{'K': 'Kindergarten', 'Second': '2nd', 'First': '1st'}

In [25]:
# creating a new column for demonstrative purposes.  
df['Scholar_Grade'] = df['Scholar Grade'].map(grade_dict).fillna(df['Scholar Grade'])
df.head()

Unnamed: 0,Unique ID,Email,Scholar Grade,Scholar 1 Enrollment Status,Guardian_First,Guardian_last,Scholar_First,Scholar_Last,street,city,state,zip,Scholar_Grade
0,7356714015,rasca@outlook.com,K,Application Submitted,London,Cameron,Atticus,Jackson,2856 RHONDA LN,DENVER,CO,80207,Kindergarten
1,3233233322,Marley.Allen@gmail.com,Second,Currently Enrolled,Marley,Allen,Carlo,Huang,1638 E WASHINGTON ST,DENVER,CO,80219,2nd
2,4344344433,Yair.Harrington@gmail.com,Second,Currently Enrolled,Yair,Harrington,Mitchell,Castillo,628 N LAFAYETTE ST,DENVER,CO,80219,2nd
3,5467891234,Ashanti.Huynh@gmail.com,Second,Currently Enrolled,Ashanti,Huynh,Melina,Ward,58 S 2ND ST,DENVER,CO,80219,2nd
4,7899877799,Ashanti.Sellers@gmail.com,Second,Currently Enrolled,Ashanti,Sellers,Braden,McClain,58 S 2ND ST,DENVER,CO,80219,2nd


### Drop Column for sholar grade
We can now drop the original columns 'Scholar Grade' 

In [26]:
df = df.drop(['Scholar Grade'],axis=1)
df.head()

Unnamed: 0,Unique ID,Email,Scholar 1 Enrollment Status,Guardian_First,Guardian_last,Scholar_First,Scholar_Last,street,city,state,zip,Scholar_Grade
0,7356714015,rasca@outlook.com,Application Submitted,London,Cameron,Atticus,Jackson,2856 RHONDA LN,DENVER,CO,80207,Kindergarten
1,3233233322,Marley.Allen@gmail.com,Currently Enrolled,Marley,Allen,Carlo,Huang,1638 E WASHINGTON ST,DENVER,CO,80219,2nd
2,4344344433,Yair.Harrington@gmail.com,Currently Enrolled,Yair,Harrington,Mitchell,Castillo,628 N LAFAYETTE ST,DENVER,CO,80219,2nd
3,5467891234,Ashanti.Huynh@gmail.com,Currently Enrolled,Ashanti,Huynh,Melina,Ward,58 S 2ND ST,DENVER,CO,80219,2nd
4,7899877799,Ashanti.Sellers@gmail.com,Currently Enrolled,Ashanti,Sellers,Braden,McClain,58 S 2ND ST,DENVER,CO,80219,2nd


## Reorder Columns
We will reorder the columns back to their original spots

In [27]:
cols = df.columns.tolist()
cols

['Unique ID',
 'Email',
 'Scholar 1 Enrollment Status',
 'Guardian_First',
 'Guardian_last',
 'Scholar_First',
 'Scholar_Last',
 'street',
 'city',
 'state',
 'zip',
 'Scholar_Grade']

In [28]:
#new column order
correct_order = ['Unique ID','Guardian_First','Guardian_last','Email','street','city','state','zip',
                'Scholar_First','Scholar_Last','Scholar_Grade','Scholar 1 Enrollment Status']
reordered_df = df[correct_order].copy()

In [29]:
#first 5 results
reordered_df.head()

Unnamed: 0,Unique ID,Guardian_First,Guardian_last,Email,street,city,state,zip,Scholar_First,Scholar_Last,Scholar_Grade,Scholar 1 Enrollment Status
0,7356714015,London,Cameron,rasca@outlook.com,2856 RHONDA LN,DENVER,CO,80207,Atticus,Jackson,Kindergarten,Application Submitted
1,3233233322,Marley,Allen,Marley.Allen@gmail.com,1638 E WASHINGTON ST,DENVER,CO,80219,Carlo,Huang,2nd,Currently Enrolled
2,4344344433,Yair,Harrington,Yair.Harrington@gmail.com,628 N LAFAYETTE ST,DENVER,CO,80219,Mitchell,Castillo,2nd,Currently Enrolled
3,5467891234,Ashanti,Huynh,Ashanti.Huynh@gmail.com,58 S 2ND ST,DENVER,CO,80219,Melina,Ward,2nd,Currently Enrolled
4,7899877799,Ashanti,Sellers,Ashanti.Sellers@gmail.com,58 S 2ND ST,DENVER,CO,80219,Braden,McClain,2nd,Currently Enrolled


In [32]:
reordered_df.tail()

Unnamed: 0,Unique ID,Guardian_First,Guardian_last,Email,street,city,state,zip,Scholar_First,Scholar_Last,Scholar_Grade,Scholar 1 Enrollment Status
186,5239073231,Bethany,Moon,wagnerch@mac.com,2856 RHONDA LN,DENVER,CO,80208,Kaiden,Mayer,2nd,Welcome Call Complete
187,1014068131,Stella,Myers,s.myers@randatmail.com,827 S 10TH ST,DENVER,CO,80219,Marcus,Spencer,1st,Welcome Call Complete
188,9546297375,Rowan,Jefferson,druschel@outlook.com,618 N NEW ST,DENVER,CO,80202,Alana,Porter,1st,Welcome Call Complete
189,9009009000,Amy,Smith,aj.smith@gmail.com,111 E EMERSON ST,COMMERCE CITY,CO,80603,Alex,Jones,Third,Currently Enrolled
190,9009009001,Malik,Mays,malik.mays@gmail.com,493 FIRST ST UNIT A,BRIGHTON,CO,80601,Tonya,Mays,4,Currently Enrolled


In [33]:
reordered_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 191 entries, 0 to 190
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Unique ID                    191 non-null    object
 1   Guardian_First               191 non-null    object
 2   Guardian_last                191 non-null    object
 3   Email                        187 non-null    object
 4   street                       191 non-null    object
 5   city                         191 non-null    object
 6   state                        191 non-null    object
 7   zip                          191 non-null    object
 8   Scholar_First                191 non-null    object
 9   Scholar_Last                 191 non-null    object
 10  Scholar_Grade                188 non-null    object
 11  Scholar 1 Enrollment Status  191 non-null    object
dtypes: object(12)
memory usage: 19.4+ KB


### results: 
We can see that we did not return any errors but we have some incorrect values in Scholar Grade.

### More Advanced Solution 
We will use our df_adv and try a different approach that will be more usable. 

We will start with developing a list of acceptable entries for 'Scholar Grade' and freq_error_dict to make common corrections. Then we can use user input to enter grades.


In [35]:
#check df_adv
df_adv.tail(3)

Unnamed: 0,Unique ID,Email,Scholar Grade,Scholar 1 Enrollment Status,Guardian_First,Guardian_last,Scholar_First,Scholar_Last,street,city,state,zip
188,9546297375,druschel@outlook.com,1st,Welcome Call Complete,Rowan,Jefferson,Alana,Porter,618 N NEW ST,DENVER,CO,80202
189,9009009000,aj.smith@gmail.com,Third,Currently Enrolled,Amy Jo,Smith,Alex,Smith Jones,111 E EMERSON ST,COMMERCE CITY,CO,80603
190,9009009001,malik.mays@gmail.com,4,Currently Enrolled,Malik,Mays,Tonya,Mays,493 FIRST ST UNIT A,BRIGHTON,CO,80601


## List and Dictionary Creation

In [43]:
#list of acceptable entries
allowed_grades = ['PreK 3','Kindergarten','1st','2nd','3rd','4th',
                  '5th','6th','7th','8th','9th','10th','11th','12th','unknown']
#common errors dictionary
freq_error_dict = {'K':'Kindergarten','First':'1st','Second':'2nd',
                  'Third':'3rd','Fourth':'4th','Fifth':'5th',
                  'Sixth':'6th','Seventh':'7th','Eigth':'8th',
                  'Ninth':'9th','Tenth':'10th','Eleventh':'11th',
                  'Twelfth':'12th'}

## Handle Frequent Errors First
We will use the dictionary to handle the frequent errors first.

We will still create a new column for demonstrative purposes.

In [44]:
df_adv['Scholar_Grade'] = df_adv['Scholar Grade'].map(freq_error_dict).fillna(df_adv['Scholar Grade'])
df_adv.tail()

Unnamed: 0,Unique ID,Email,Scholar Grade,Scholar 1 Enrollment Status,Guardian_First,Guardian_last,Scholar_First,Scholar_Last,street,city,state,zip,Scholar_Grade
186,5239073231,wagnerch@mac.com,2nd,Welcome Call Complete,Bethany,Moon,Kaiden,Mayer,2856 RHONDA LN,DENVER,CO,80208,2nd
187,1014068131,s.myers@randatmail.com,1st,Welcome Call Complete,Stella,Myers,Marcus,Spencer,827 S 10TH ST,DENVER,CO,80219,1st
188,9546297375,druschel@outlook.com,1st,Welcome Call Complete,Rowan,Jefferson,Alana,Porter,618 N NEW ST,DENVER,CO,80202,1st
189,9009009000,aj.smith@gmail.com,Third,Currently Enrolled,Amy Jo,Smith,Alex,Smith Jones,111 E EMERSON ST,COMMERCE CITY,CO,80603,3rd
190,9009009001,malik.mays@gmail.com,4,Currently Enrolled,Malik,Mays,Tonya,Mays,493 FIRST ST UNIT A,BRIGHTON,CO,80601,4


In [50]:
#check unique values 
adv_unique = df_adv['Scholar_Grade'].unique().tolist()
adv_unique

['Kindergarten', '2nd', '1st', 'PreK 3', nan, '3rd', '4']

In [52]:
## check to see if we have any values not in approved list.
np.setdiff1d(adv_unique,allowed_grades).tolist()

['4', 'nan']

## Function to update bad 'Scholar_Grade' values
We have 2 to fix this time so we will create a function to update these values.

In [59]:
## create a function to change the grades to values in the approved list only
def make_approved(grade):
    if grade in allowed_grades:
        return grade
    else:
        print('\n')
        print('{} is in the allowed list: '.format(grade))
        print(allowed_grades)
        
        grade = input("Please enter an allowed grade")
        return grade
        
        
test_grades=['1st','4','nan']
for grade in test_grades:
    print(make_approved(grade))

1st


4 is in the allowed list: 
['PreK 3', 'Kindergarten', '1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th', '11th', '12th', 'unknown']
Please enter an allowed grade4th
4th


nan is in the allowed list: 
['PreK 3', 'Kindergarten', '1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th', '11th', '12th', 'unknown']
Please enter an allowed gradeunknown
unknown


## Applying to the dataframe
Now we can apply that formula to the dataframe.  It should be quick because we know we only have those two errors this time.

In [60]:
df_adv['Scholar_Grade'] = df_adv['Scholar_Grade'].apply(lambda x: make_approved(x))
#show last 5
df_adv.tail()



nan is in the allowed list: 
['PreK 3', 'Kindergarten', '1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th', '11th', '12th', 'unknown']
Please enter an allowed gradeunknown


nan is in the allowed list: 
['PreK 3', 'Kindergarten', '1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th', '11th', '12th', 'unknown']
Please enter an allowed gradeunknown


nan is in the allowed list: 
['PreK 3', 'Kindergarten', '1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th', '11th', '12th', 'unknown']
Please enter an allowed gradeunknown


4 is in the allowed list: 
['PreK 3', 'Kindergarten', '1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th', '11th', '12th', 'unknown']
Please enter an allowed grade4th


Unnamed: 0,Unique ID,Email,Scholar Grade,Scholar 1 Enrollment Status,Guardian_First,Guardian_last,Scholar_First,Scholar_Last,street,city,state,zip,Scholar_Grade
186,5239073231,wagnerch@mac.com,2nd,Welcome Call Complete,Bethany,Moon,Kaiden,Mayer,2856 RHONDA LN,DENVER,CO,80208,2nd
187,1014068131,s.myers@randatmail.com,1st,Welcome Call Complete,Stella,Myers,Marcus,Spencer,827 S 10TH ST,DENVER,CO,80219,1st
188,9546297375,druschel@outlook.com,1st,Welcome Call Complete,Rowan,Jefferson,Alana,Porter,618 N NEW ST,DENVER,CO,80202,1st
189,9009009000,aj.smith@gmail.com,Third,Currently Enrolled,Amy Jo,Smith,Alex,Smith Jones,111 E EMERSON ST,COMMERCE CITY,CO,80603,3rd
190,9009009001,malik.mays@gmail.com,4,Currently Enrolled,Malik,Mays,Tonya,Mays,493 FIRST ST UNIT A,BRIGHTON,CO,80601,4th


## Drop Columns and Reorder

In [61]:
#drop columns
df_adv = df_adv.drop(['Scholar Grade'],axis=1)
#reorder columns
final_df = df_adv[correct_order].copy()

final_df.tail()

Unnamed: 0,Unique ID,Guardian_First,Guardian_last,Email,street,city,state,zip,Scholar_First,Scholar_Last,Scholar_Grade,Scholar 1 Enrollment Status
186,5239073231,Bethany,Moon,wagnerch@mac.com,2856 RHONDA LN,DENVER,CO,80208,Kaiden,Mayer,2nd,Welcome Call Complete
187,1014068131,Stella,Myers,s.myers@randatmail.com,827 S 10TH ST,DENVER,CO,80219,Marcus,Spencer,1st,Welcome Call Complete
188,9546297375,Rowan,Jefferson,druschel@outlook.com,618 N NEW ST,DENVER,CO,80202,Alana,Porter,1st,Welcome Call Complete
189,9009009000,Amy Jo,Smith,aj.smith@gmail.com,111 E EMERSON ST,COMMERCE CITY,CO,80603,Alex,Smith Jones,3rd,Currently Enrolled
190,9009009001,Malik,Mays,malik.mays@gmail.com,493 FIRST ST UNIT A,BRIGHTON,CO,80601,Tonya,Mays,4th,Currently Enrolled


## Export Files


In [62]:
#export to excel
final_df.to_excel('data/exported/ready_to_import.xlsx')
final_df.to_csv('data/exported/ready_to_import.csv',index=False)

## Notebook Results
We have exported a csv and excel file and have provided a more advance solution than in the first phase.