In [15]:
import sqlite3
import pandas as pd
import ast
import numpy as np
from datetime import datetime

In [16]:
con = sqlite3.connect('cademycode.db')
cur = con.cursor()

In [17]:
table_list = [a for a in cur.execute("SELECT name FROM sqlite_master WHERE type = 'table'")]
print(table_list)

[('cademycode_students',), ('cademycode_courses',), ('cademycode_student_jobs',)]


In [18]:
students = pd.read_sql_query("SELECT * FROM cademycode_students", con)
career_paths = pd.read_sql_query("SELECT * FROM cademycode_courses", con)
student_jobs = pd.read_sql_query("SELECT * FROM cademycode_student_jobs", con)

In [19]:
print('students: ', len(students))
print('career_paths: ', len(career_paths))
print('student_jobs: ', len(student_jobs))

students:  5000
career_paths:  10
student_jobs:  13


## Working with the `Students` table

Looking at the head of the table:

In [20]:
students.head()

Unnamed: 0,uuid,name,dob,sex,contact_info,job_id,num_course_taken,current_career_path_id,time_spent_hrs
0,1,Annabelle Avery,1943-07-03,F,"{""mailing_address"": ""303 N Timber Key, Irondal...",7.0,6.0,1.0,4.99
1,2,Micah Rubio,1991-02-07,M,"{""mailing_address"": ""767 Crescent Fair, Shoals...",7.0,5.0,8.0,4.4
2,3,Hosea Dale,1989-12-07,M,"{""mailing_address"": ""P.O. Box 41269, St. Bonav...",7.0,8.0,8.0,6.74
3,4,Mariann Kirk,1988-07-31,F,"{""mailing_address"": ""517 SE Wintergreen Isle, ...",6.0,7.0,9.0,12.31
4,5,Lucio Alexander,1963-08-31,M,"{""mailing_address"": ""18 Cinder Cliff, Doyles b...",7.0,14.0,3.0,5.64


`contact_info` has a JSON format, might be a good idea to look into this later. Let's look at column data types:

In [21]:
students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   uuid                    5000 non-null   int64 
 1   name                    5000 non-null   object
 2   dob                     5000 non-null   object
 3   sex                     5000 non-null   object
 4   contact_info            5000 non-null   object
 5   job_id                  4995 non-null   object
 6   num_course_taken        4749 non-null   object
 7   current_career_path_id  4529 non-null   object
 8   time_spent_hrs          4529 non-null   object
dtypes: int64(1), object(8)
memory usage: 351.7+ KB


We can see there is missing data in the last 4 columns of the dataframe. We have to make sure that `dob` is in datetime format that we can use to determine each student's age for better grouping. Numerical columns should be either integers or floats. We can also treat `job_id` and `current_career_path_id` as categorical data.

In [22]:
students[students.isnull().any(axis=1)]

Unnamed: 0,uuid,name,dob,sex,contact_info,job_id,num_course_taken,current_career_path_id,time_spent_hrs
15,16,Norene Dalton,1976-04-30,F,"{""mailing_address"": ""130 Wishing Essex, Branch...",6.0,0.0,,
19,20,Sofia van Steenbergen,1990-02-21,N,"{""mailing_address"": ""634 Clear Barn Dell, Beam...",7.0,13.0,,
25,26,Doug Browning,1970-06-08,M,"{""mailing_address"": ""P.O. Box 15845, Devine, F...",7.0,,5.0,1.92
26,27,Damon Schrauwen,1953-10-31,M,"{""mailing_address"": ""P.O. Box 84659, Maben, Ge...",4.0,,10.0,3.73
30,31,Christoper Warner,1989-12-28,M,"{""mailing_address"": ""556 Stony Highlands, Drai...",2.0,5.0,,
...,...,...,...,...,...,...,...,...,...
4948,4949,Dewitt van Malsem,1949-03-08,M,"{""mailing_address"": ""423 Course Trail, Wilmot,...",4.0,7.0,,
4956,4957,Todd Stamhuis,1961-06-15,M,"{""mailing_address"": ""251 Grand Rose Underpass,...",7.0,8.0,,
4974,4975,Jorge Creelman,1944-11-24,M,"{""mailing_address"": ""919 Well Overpass, Linden...",2.0,15.0,,
4980,4981,Brice Franklin,1946-12-01,M,"{""mailing_address"": ""947 Panda Way, New Bedfor...",4.0,,5.0,8.66


No pattern can be found, we can only tell that there are 707 rows that have AT LEAST ONE null value in their columns.

### Calculating approximate age

To create an `age` column within our dataframe, we will have to use the `pd.to_datetime` function. We will also create an `age_group` column that should help us better classify our students.

In [31]:
now = datetime.now()
students['age'] = (now - pd.to_datetime(students['dob'])).dt.days // 365
students['age_group'] = (students['age'] // 10)*10
students.head()

Unnamed: 0,uuid,name,dob,sex,contact_info,job_id,num_course_taken,current_career_path_id,time_spent_hrs,age,age_group
0,1,Annabelle Avery,1943-07-03,F,"{""mailing_address"": ""303 N Timber Key, Irondal...",7.0,6.0,1.0,4.99,80,80
1,2,Micah Rubio,1991-02-07,M,"{""mailing_address"": ""767 Crescent Fair, Shoals...",7.0,5.0,8.0,4.4,32,30
2,3,Hosea Dale,1989-12-07,M,"{""mailing_address"": ""P.O. Box 41269, St. Bonav...",7.0,8.0,8.0,6.74,33,30
3,4,Mariann Kirk,1988-07-31,F,"{""mailing_address"": ""517 SE Wintergreen Isle, ...",6.0,7.0,9.0,12.31,35,30
4,5,Lucio Alexander,1963-08-31,M,"{""mailing_address"": ""18 Cinder Cliff, Doyles b...",7.0,14.0,3.0,5.64,60,60


### Explode the dictionary

Now we need to extract all the information from the ``contact_info`` column so we can have an individual column for each key from the dictionary. First, let's take a look at the data type for said column:

In [32]:
type(students['contact_info'][0])

str

Notice how this column is made up of strings rather than dictionaries. We need to transform this column into dictionaries in order to extract the info with the help of `pd.json_normalize`. Here is how we can implement this:

In [33]:
students['contact_info'] = students['contact_info'].apply(lambda x: ast.literal_eval(x))

Now that we have transformed our `contact_info` column into a set of dictionaries, we can extract the information with the following methods:

In [35]:
explode_contact_info = pd.json_normalize(students['contact_info'])
explode_contact_info.head()

Unnamed: 0,mailing_address,email
0,"303 N Timber Key, Irondale, Wisconsin, 84736",annabelle_avery9376@woohoo.com
1,"767 Crescent Fair, Shoals, Indiana, 37439",rubio6772@hmail.com
2,"P.O. Box 41269, St. Bonaventure, Virginia, 83637",hosea_dale8084@coldmail.com
3,"517 SE Wintergreen Isle, Lane, Arkansas, 82242",kirk4005@hmail.com
4,"18 Cinder Cliff, Doyles borough, Rhode Island,...",alexander9810@hmail.com


We proceed to concatenate this dataframe with our original `students` dataframe:

In [36]:
students = pd.concat([students.drop('contact_info', axis=1), explode_contact_info], axis=1)
students.head()

Unnamed: 0,uuid,name,dob,sex,job_id,num_course_taken,current_career_path_id,time_spent_hrs,age,age_group,mailing_address,email
0,1,Annabelle Avery,1943-07-03,F,7.0,6.0,1.0,4.99,80,80,"303 N Timber Key, Irondale, Wisconsin, 84736",annabelle_avery9376@woohoo.com
1,2,Micah Rubio,1991-02-07,M,7.0,5.0,8.0,4.4,32,30,"767 Crescent Fair, Shoals, Indiana, 37439",rubio6772@hmail.com
2,3,Hosea Dale,1989-12-07,M,7.0,8.0,8.0,6.74,33,30,"P.O. Box 41269, St. Bonaventure, Virginia, 83637",hosea_dale8084@coldmail.com
3,4,Mariann Kirk,1988-07-31,F,6.0,7.0,9.0,12.31,35,30,"517 SE Wintergreen Isle, Lane, Arkansas, 82242",kirk4005@hmail.com
4,5,Lucio Alexander,1963-08-31,M,7.0,14.0,3.0,5.64,60,60,"18 Cinder Cliff, Doyles borough, Rhode Island,...",alexander9810@hmail.com


We will further split our `contact_info` to extract more bits of information for each student. Notice how `mailing_address` contains streer name, city, state, and zip code. This is valuable information that the analytics team might want to further explore.

In [38]:
split_address = students.mailing_address.str.split(',', expand=True)
split_address.columns = ['street', 'city', 'state', 'zip_code']
split_address.head()

Unnamed: 0,street,city,state,zip_code
0,303 N Timber Key,Irondale,Wisconsin,84736
1,767 Crescent Fair,Shoals,Indiana,37439
2,P.O. Box 41269,St. Bonaventure,Virginia,83637
3,517 SE Wintergreen Isle,Lane,Arkansas,82242
4,18 Cinder Cliff,Doyles borough,Rhode Island,73737


Now, we will include this dataframe into our `students` dataframe and drop the `mailing_address` column in the process.

In [39]:
students = pd.concat([students.drop('mailing_address', axis=1), split_address], axis=1)
students.head()

Unnamed: 0,uuid,name,dob,sex,job_id,num_course_taken,current_career_path_id,time_spent_hrs,age,age_group,email,street,city,state,zip_code
0,1,Annabelle Avery,1943-07-03,F,7.0,6.0,1.0,4.99,80,80,annabelle_avery9376@woohoo.com,303 N Timber Key,Irondale,Wisconsin,84736
1,2,Micah Rubio,1991-02-07,M,7.0,5.0,8.0,4.4,32,30,rubio6772@hmail.com,767 Crescent Fair,Shoals,Indiana,37439
2,3,Hosea Dale,1989-12-07,M,7.0,8.0,8.0,6.74,33,30,hosea_dale8084@coldmail.com,P.O. Box 41269,St. Bonaventure,Virginia,83637
3,4,Mariann Kirk,1988-07-31,F,6.0,7.0,9.0,12.31,35,30,kirk4005@hmail.com,517 SE Wintergreen Isle,Lane,Arkansas,82242
4,5,Lucio Alexander,1963-08-31,M,7.0,14.0,3.0,5.64,60,60,alexander9810@hmail.com,18 Cinder Cliff,Doyles borough,Rhode Island,73737


We will now take a look at our data types:

In [40]:
students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   uuid                    5000 non-null   int64 
 1   name                    5000 non-null   object
 2   dob                     5000 non-null   object
 3   sex                     5000 non-null   object
 4   job_id                  4995 non-null   object
 5   num_course_taken        4749 non-null   object
 6   current_career_path_id  4529 non-null   object
 7   time_spent_hrs          4529 non-null   object
 8   age                     5000 non-null   int64 
 9   age_group               5000 non-null   int64 
 10  email                   5000 non-null   object
 11  street                  5000 non-null   object
 12  city                    5000 non-null   object
 13  state                   5000 non-null   object
 14  zip_code                5000 non-null   object
dtypes: i

### Fixing column data types

We want to change the data type for certain columns so we can use them for analysis down the line. We know that `num_couyrse_taken` should be an integer value, and `time_spent_hrs` should be a float. Since there is some null data for `job_id` and `current_career_path_id`, we have to cast them into floats instead of integers for the time being.

In [42]:
students['job_id'] = students['job_id'].astype(float)
students['current_career_path_id'] = students['current_career_path_id'].astype(float)
students['num_course_taken'] = students['num_course_taken'].astype(float)
students['time_spent_hrs'] = students['time_spent_hrs'].astype(float)
students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   uuid                    5000 non-null   int64  
 1   name                    5000 non-null   object 
 2   dob                     5000 non-null   object 
 3   sex                     5000 non-null   object 
 4   job_id                  4995 non-null   float64
 5   num_course_taken        4749 non-null   float64
 6   current_career_path_id  4529 non-null   float64
 7   time_spent_hrs          4529 non-null   float64
 8   age                     5000 non-null   int64  
 9   age_group               5000 non-null   int64  
 10  email                   5000 non-null   object 
 11  street                  5000 non-null   object 
 12  city                    5000 non-null   object 
 13  state                   5000 non-null   object 
 14  zip_code                5000 non-null   