# Data Cleaning and Summary Statistics

In [None]:
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd

Load the data of the survey taken by Data Science students.

In [None]:
df  = pd.read_csv('data science survey.csv')

We don't like this data set for a variety of reasons:
<ul>
<li>Columnn headers are too long
<li>Some cell values are too long
<li>Some cell values are yes/no, but we prefer 1/0
</ul>

In [None]:
df.head(1)

## Cleaning

### Let's change some of the column names

In [None]:
df.columns = ['Timestamp',
             'Job',
             'BachTime',
             'Program',
             'ProgSkills',
             'C',
             'CPP',
             'CS',
             'Java',
             'Python',
             'JS',
             'R',
             'SQL',
             'SAS',
             'Excel',
             'Tableau',
             'Regression',
             'Classification',
             'Clustering']

In [None]:
# print top 5 rows
df.head(???)

Now the column names look a lot better!


### Let's remove timestamps

Suppose that we don't need the timestamps. Here is how to remove a column

In [None]:
# drop column 'Timestamp' from df
# If you set inplace = True , the drop() method will delete rows or columns directly from the original dataframe.
df.???(columns='Timestamp',inplace=???)

In [None]:
df.head(2)

### Replace job with 0 (no job), 0.5 (part time), and 1 (full time)

<p>We want to replace the values of the column "Job", as follows:</p>
<p>
<ul>
<li><i>No, I'm not working at the moment</i> --> 0
<li><i>Yes, I have a part-time job</i> --> 0.5
<li><i>Yes, I have a full-time job</i> --> 1
</ul>
</p>
<p>
We will show three alternative solutions (1, 2, and 3) to perform this task. They will result in the creation of three columns (Job1, Job2, and Job3). At the end, we will delete the original column Job, we will delete two of these columns, and we will rename the remaining column "Job".
</p>

#### Solution 1 (column Job1)

Create a column 'Job1' through <i>df.loc</i>.

In [None]:
(df['Job'] == 'No, I\'m not working at the moment').head()

In [None]:
# Generate a new column 'Job1' with 0 (no job), 0.5 (part time), and 1 (full time)
df.???[df['Job'] == 'No, I\'m not working at the moment', ???] = 0.0

In [None]:
df.???[df['Job'] == 'Yes, I have a part-time job', ???] = 0.5

In [None]:
df.???[df['Job'] == 'Yes, I have a full-time job', ???] = 1.0

In [None]:
df.head(6)

In [None]:
# Print column Job1's top 10 rows
df.loc[:,'Job1'][:???]

In [None]:
# Print column Job1's top 10 rows
df.Job1[:???]

Or use **.unique()** and **.nunique()** to check the convertion results.

In [None]:
# print unique values in column Job1
df.Job1.???()

In [None]:
# print the number of unique values in column Job1
df.Job1.???()

#### Solution 2

Here, we will use the function <b>apply</b> on the column <i>Job</i>. The function "apply" requires as input a function that specifies how to transform each value. The function should perform the following transformations:
<ul>
<li>No, I'm not working at the moment => 0</li>
<li>Yes, I have a part-time job => 0.5</li>
<li>Yes, I have a full-time job => 1</li>
</ul>

In [None]:
# Generate a function to replce job string with 0 (no job), 0.5 (part time), and 1 (full time)

def Job2Num(Job_String):
    if Job_String == 'No, I\'m not working at the moment':
        return ???
    elif Job_String == 'Yes, I have a part-time job':
        return ???
    else:
        return ???

In [None]:
# Apply Job2Num function to generate a new column 'Job2' with 0 (no job), 0.5 (part time), and 1 (full time)

df['Job2'] = df['Job'].???(Job2Num)

In [None]:
# print top 6 rows
df.head(???)

### Another version of Job2Num

In [None]:
# Generate a function to replce job string with 0 (no job), 0.5 (part time), and 1 (full time)
# using .startswith function for string comparison

def Job2NumV2(Job_String):
    if Job_String.???('No'):
        return 0
    elif 'part-time' in Job_String:
        return 0.5
    else:
        return 1

#### Solution 3

Instead of declaring a function as above, we can pass a lambda (or anonymous) function

In [None]:
# Generate a lambda function to replce job string with 0 (no job), 0.5 (part time), and 1 (full time)
df['Job3'] = df['Job'].apply(??? x: 0 if x.startswith('No') \
                            else 0.5 if 'part-time' in x \
                            else 1)

In [None]:
# print top 10 rows
df.head(???)

#### Finalize

The DataFrame has now the original column <i>Job</i> and three identical columns <i>Job1</i>, <i>Job2</i>, and <i>Job3</i>. We delete the original column <i>Job</i>, as well as <i>Job1</i> and <i>Job2</i>, and then we will rename the remaining column from <i>Job3</i> to <i>Job</i>.

In [None]:
# drop columns Job1, Job2, and Job3.
df.???(columns=[???], inplace=True)

In [None]:
# rename the column Job3 to Job.
df.???(columns={'Job3':'Job'}, inplace=True)

In [None]:
df.head(6)

### Replace <i>BachTime</i> with a dummy variable

The time from graduation <i>BachTime</i> can have the following values:
<ul>
<li>less than 1 year ago</li>
<li>longer than 1 year ago but less than 3 years ago</li>
<li>longer than 3 years ago but less than 5 years ago</li>
<li>over 5 years ago</li>
</ul>

We want to conver the column BachTime to <i>dummy variables</i>, that is we want to create four columns ('Bach_0to1', 'Bach_1to3', 'Bach_3to5', 'Bach_5Plus') of which only one will be 1 and the others 0. These columns will indicate which is the duration of the time from the bachelor's degree.

The method <i>get_dummies</i> performs precisely this task. It transforms a column that contains $V$ unique values into $V$ new column (one for each value $v \in V$).

Let's create two dummy dataframes to try out two different methods.

In [None]:
# use get_dummies to convert categorical variable into dummy/indicator variables.
dumDF = pd.???(df, columns=['BachTime'])
dumDF1 = pd.???(df, columns=['BachTime'])

In [None]:
dumDF is dumDF1

In [None]:
dumDF1.head(3)

The column names generated by the method <i>get_dummies</i> are very wordy because they use the original cell contents. So, let us rename the columns. We can use two different methods:

If want to see all columns, can use pd.set_option cmd

In [None]:
pd.set_option('display.max_columns',None)

### option 1:
<b>We are using dumDF1</b>

We can do it column by column with rename():

In [None]:
dumDF1.???(columns={'BachTime_< 1 year ago':'Bach_0to1'}, inplace=True)

In [None]:
# Or use axis=1 option
# dumDF1.rename({'BachTime_< 1 year ago':'Bach_0to1'},axis=1,inplace=True)

In [None]:
dumDF1.???(columns={'BachTime_longer than 1 year ago but less than 3 years ago':'Bach_1to3'}, inplace=True)

In [None]:
dumDF1.???(columns={'BachTime_longer than 3 years ago but less than 5 years ago':'Bach_3to5'}, inplace=True)

In [None]:
dumDF1.???(columns={'BachTime_over 5 years ago':'Bach_5Plus'}, inplace=True)

In [None]:
dumDF1.head(3)

### Or, we do the following:

<b>We are using dumDF</b>
1. transform the columns into a list
2. modify the list
3. set the new columns vector all at once

In [None]:
# Get the column index of dumDF and convert them to list (tolist())
newcolNames = dumDF.columns.???()

In [None]:
# change the last four elements in newcolNames list
newcolNames[???:] = ['Bach_0to1', 'Bach_1to3', 'Bach_3to5', 'Bach_5Plus']

In [None]:
# replace the columns index with newcolNames
dumDF.columns = newcolNames

In [None]:
dumDF.head(3)

Overwrite our original dataframe df

In [None]:
df = dumDF

In [None]:
df.head(3)

### Replace Yes with 1 and No with 0

Let us replace everywhere "Yes" with 1 and "No" with 0

In [None]:
# The replace() function is used to replace values given in to_replace with value.
df.replace(to_replace=???, value=???, inplace=True)

In [None]:
df.replace(to_replace=???, value=???, inplace=True)

In [None]:
df.head(3)

### Adding simple columns

Let's add a column that counts how many programming languages students know. Languages = C+CPP+CS+Java+Python+JS+R+SQL+SAS

In [None]:
df['Languages'] = df.C+df.CPP+df.CS+df.Java+df.Python+df.JS\
                    +df.R+df.SQL+df.SAS

In [None]:
df.head(3)

To rearrange column order:<br/>
<b>reindex(columns=[the columns in the order that you want])</b>

In [None]:
# Check columns
df.???

Let's place <i>Languages</i> first

In [None]:
# Get the df's column label index and convert them to a list
list_of_columns = df.columns.tolist()

In [None]:
# pop the last element out and insert it back to list_of_columns at position 0
list_of_columns.insert(0, list_of_columns.pop(-1))

In [None]:
# Check top 3 elements in list_of_columns
list_of_columns[:3]

In [None]:
# Use reindex function to reorder columns 
df = df.???(columns=list_of_columns)

In [None]:
df.head(3)

### Adding complex columns (advanced topic)

Let's add a 0-1 column called "Expert" if <i>Languages</i> is 3 or more.

First of all, note that you can do it very easily with what you already know:

In [None]:
df.loc[df['Languages']>=3,???]=???

In [None]:
df.loc[df['Languages']<3,???]=???

In [None]:
df.head(10)

Another way to do it

In [None]:
df['Languages'] >= 3

True/False operate with a floating point (+ 0.0) will convert to 1 or 0

In [None]:
# Covert boolean into 0/1
(df['Languages'] >= 3) + ???

In [None]:
# Add a new column 'Expert1' using above-mentioned technique
df['Expert1'] = (df['Languages'] >= 3) + ???

In [None]:
# Check top10 rows in Expert1 column
df['Expert1'].head(10)

But suppose that the calculation is complicated. First, we need to define a function that given a row (i.e., a Series that represents a student) returns 1 if the student is and expert (and 0 otherwise). This row has index labels 'Job', 'BachTime, 'Program', etc)

In [None]:
def ExpertFunction(row):
    ??? row['Languages'] >= 3:
        return 1.0
    ???:
        return 0.0

Second, we use the function <b>apply</b> with axis=1, which applies the function across columns and returns a Series:

In [None]:
# Apply the function row-wise
df['Expert2']=df.apply(ExpertFunction, axis=???)

In [None]:
df.head(3)

Alternatively, instead of defining a function with <i>def</i>, we can use a lambda function, which allows us to define the function "inline"

In [None]:
df['Expert3'] = df.apply(??? row: 1.0 if row['Languages'] >= 3 else 0.0, axis=???)

In [None]:
df.head(3)

In [None]:
# drop 'Expert3','Expert2',and'Expert1' column
df.???(['Expert3','Expert2','Expert1'],axis=???, inplace=True)

In [None]:
df.head(3)

<b>Warning</b>: <i>DataFrame.apply</i> is slow, especially when called on all rows (i.e., axis=1). If you can, you should use operations among Series and scalars. The performance on the example above can be improved a lot as follows:

In [None]:
%timeit df['Expert'] = df.apply(lambda row : 1.0 if row['Languages'] >= 3 else 0.0, axis=1)

In [None]:
%timeit df['Expert'] = df.Languages.apply(lambda x: 1.0 if x>=3 else 0.0)

### DataFrame.apply vs Series.apply

In pandas 0.20, DataFrame.apply is slow whereas Series.apply is fast. For example, say that we want to add a column <i>ProgramLower</i> with the lower case value of Program.

#### Solution 1: DataFrame.apply (slow! Do not use!)

In [None]:
%timeit df.apply(lambda x: x['Program'].lower(), axis=1)

#### Solution 2: Series.apply (fast because it uses vectorization)

In [None]:
%timeit df.Program.apply(lambda y: y.lower())

### Write it to a file

In [None]:
# output the cleaned dataframe as a new csv file
df.???('cleaned_survey.csv')

### Summary Functions

pandas provides many simple "summary functions" which restructure the data in some useful way. 

### .describe()

In [None]:
df.???()

This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the dtype of the input. The output above only makes sense for numerical data; for string data here's what we get:

In [None]:
# apply describe function to program series
df.???.describe()

To see a list of unique values we can use the **unique** function:

In [None]:
# use unique() function to check unique values in Program column
df.Program.???()

In [None]:
# check number of unique values in Program column
???(df.Program.unique())

To see a list of unique values and how often they occur in the dataset, we can use the **value_counts** method:

In [None]:
# Use value_counts function to count unique values in Program column
df.Program.???()

versus .count()

In [None]:
# The count() method counts the number of not empty values for each row
df.Program.count()

**.shape** Return a tuple representing the dimensionality of the DataFrame.

In [None]:
df.???

## Problems

How many students know SQL?

In [None]:
df.SQL.???()

What's the average programming skills of MSIS students? Compare it to that of MBA students

In [None]:
df[???].ProgSkills.mean()

In [None]:
df[???].ProgSkills.mean()

How many students know classification better than clustering? And how many clustering better than classification?

In [None]:
(???).sum()

In [None]:
(???).sum()

Correlation

In [None]:
df.corr(numeric_only=True)

## Find the strongest correlations

What are the top 10 correlations and the top 10 anti-correlations? 

To answer this question, we need to "stack" the result of <i>cor.stack()</i>, which means turning it into a Series with a "Hierarchical index" (that is, an index of two elements).

In [None]:
cor = df.corr(numeric_only=True)

In [None]:
# Apply stack function
cor.???()

Remove the correlations equal to 1 (they are self correlations); then, pick one correlation every two (as they all appear twice)

Use list slicing: x[startAt:endBefore:skip]

In [None]:
cor[cor < 1].stack().nlargest(???)

In [None]:
# correlation betweem Classification and Clustering is the same as correlation betweem Clustering and Classification
# Remove the duplicated correlations
cor[cor < 1].stack().nlargest(???)[::2]

Do the same for negative correlations

In [None]:
cor[cor < 1].stack().???(20)[::2]