# Data Cleaning and Summary Statistics

In [1]:
%autosave 0

In [1]:
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd

Load the data of the survey taken by Data Science students.

In [2]:
df  = pd.read_csv('data science survey.csv')

We don't like this data set for a variety of reasons:
<ul>
<li>Columnn headers are too long
<li>Some cell values are too long
<li>Some cell values are yes/no, but we prefer 1/0
</ul>

In [3]:
df.head(1)

Unnamed: 0,Timestamp,Do you have a job?,How long ago did you get your Bachelor degree?,What program are enrolled in?,How would you rate your computer programming background?,Have you ever programmed in C?,Have you ever programmed in C++?,Have you ever programmed in C#?,Have you ever programmed in Java?,Have you ever programmed in Python?,Have you ever programmed in Javascript?,Have you ever programmed in R?,Have you ever programmed in SQL?,Have you ever used SAS?,Have you ever used Excel?,Have you ever used Tableau?,Have you ever run a regression?,"How familiar are you with the Machine Learning task called ""classification""?","How familiar are you with the Machine Learning task called ""clustering""?"
0,2017/01/09 2:48:20 PM MST,"No, I'm not working at the moment",longer than 1 year ago but less than 3 years ago,MSIS,4,Yes,Yes,No,Yes,Yes,Yes,No,Yes,No,Yes,No,Yes,4,4


## Cleaning

### Let's change some of the column names

In [5]:
df.columns = ['Timestamp',
             'Job',
             'BachTime',
             'Program',
             'ProgSkills',
             'C',
             'CPP',
             'CS',
             'Java',
             'Python',
             'JS',
             'R',
             'SQL',
             'SAS',
             'Excel',
             'Tableau',
             'Regression',
             'Classification',
             'Clustering']

In [6]:
df.head()

Unnamed: 0,Timestamp,Job,BachTime,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering
0,2017/01/09 2:48:20 PM MST,"No, I'm not working at the moment",longer than 1 year ago but less than 3 years ago,MSIS,4,Yes,Yes,No,Yes,Yes,Yes,No,Yes,No,Yes,No,Yes,4,4
1,2017/01/09 3:15:59 PM MST,"Yes, I have a part-time job",over 5 years ago,MSIS,3,Yes,Yes,No,Yes,No,No,No,Yes,No,Yes,No,No,2,2
2,2017/01/09 4:48:48 PM MST,"No, I'm not working at the moment",longer than 3 years ago but less than 5 years ago,MSIS,3,No,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,3,3
3,2017/01/09 4:48:51 PM MST,"No, I'm not working at the moment",over 5 years ago,MSIS,3,Yes,No,No,Yes,Yes,No,Yes,Yes,No,Yes,No,Yes,2,3
4,2017/01/09 4:50:03 PM MST,"No, I'm not working at the moment",longer than 3 years ago but less than 5 years ago,MSIS,3,Yes,No,No,Yes,Yes,No,No,Yes,No,Yes,No,No,1,1


Now the column names look a lot better!


### Let's remove timestamps

Suppose that we don't need the timestamps. Here is how to remove a column

In [7]:
df.drop('Timestamp', axis=1, inplace=True)

In [8]:
df.head(2)

Unnamed: 0,Job,BachTime,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering
0,"No, I'm not working at the moment",longer than 1 year ago but less than 3 years ago,MSIS,4,Yes,Yes,No,Yes,Yes,Yes,No,Yes,No,Yes,No,Yes,4,4
1,"Yes, I have a part-time job",over 5 years ago,MSIS,3,Yes,Yes,No,Yes,No,No,No,Yes,No,Yes,No,No,2,2


The argument <i>axis=1</i> is to indicate that we are removing a column, not a row.

### Replace job with 0 (no job), 0.5 (part time), and 1 (full time)

<p>We want to replace the values of the column "Job", as follows:</p>
<p>
<ul>
<li><i>No, I'm not working at the moment</i> --> 0
<li><i>Yes, I have a part-time job</i> --> 0.5
<li><i>Yes, I have a full-time job</i> --> 1
</ul>
</p>
<p>
We will show three alternative solutions (1, 2, and 3) to perform this task. They will result in the creation of three columns (Job1, Job2, and Job3). At the end, we will delete the original column Job, we will delete two of these columns, and we will rename the remaining column "Job".
</p>

#### Solution 1 (column Job1)

Create a column 'Job1' through <i>df.loc</i>.

In [9]:
df.loc[df['Job'] == 'No, I\'m not working at the moment', 'Job1'] = 0

In [10]:
df.loc[df['Job'] == 'Yes, I have a part-time job', 'Job1'] = 0.5

In [11]:
df.loc[df['Job'] == 'Yes, I have a full-time job', 'Job1'] = 1

In [12]:
df.head(6)

Unnamed: 0,Job,BachTime,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering,Job1
0,"No, I'm not working at the moment",longer than 1 year ago but less than 3 years ago,MSIS,4,Yes,Yes,No,Yes,Yes,Yes,No,Yes,No,Yes,No,Yes,4,4,0.0
1,"Yes, I have a part-time job",over 5 years ago,MSIS,3,Yes,Yes,No,Yes,No,No,No,Yes,No,Yes,No,No,2,2,0.5
2,"No, I'm not working at the moment",longer than 3 years ago but less than 5 years ago,MSIS,3,No,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,3,3,0.0
3,"No, I'm not working at the moment",over 5 years ago,MSIS,3,Yes,No,No,Yes,Yes,No,Yes,Yes,No,Yes,No,Yes,2,3,0.0
4,"No, I'm not working at the moment",longer than 3 years ago but less than 5 years ago,MSIS,3,Yes,No,No,Yes,Yes,No,No,Yes,No,Yes,No,No,1,1,0.0
5,"Yes, I have a full-time job",longer than 1 year ago but less than 3 years ago,Supply Chain Mgmt & Analytics,1,No,No,No,No,No,No,No,Yes,No,Yes,No,Yes,1,1,1.0


In [13]:
df.loc[:,'Job1'][:10]

0    0.0
1    0.5
2    0.0
3    0.0
4    0.0
5    1.0
6    0.0
7    0.0
8    1.0
9    0.5
Name: Job1, dtype: float64

In [14]:
df.Job1[:10]

0    0.0
1    0.5
2    0.0
3    0.0
4    0.0
5    1.0
6    0.0
7    0.0
8    1.0
9    0.5
Name: Job1, dtype: float64

#### Solution 2

Here, we will use the function <b>apply</b> on the column <i>Job</i>. The function "apply" requires as input a function that specifies how to transform each value. The function should perform the following transformations:
<ul>
<li>No, I'm not working at the moment => 0</li>
<li>Yes, I have a part-time job => 0.5</li>
<li>Yes, I have a full-time job => 1</li>
</ul>

In [15]:
def Job2Num(Job_String):
    if Job_String == 'No, I\'m not working at the moment':
        return 0
    elif Job_String == 'Yes, I have a part-time job':
        return 0.5
    else:
        return 1

In [16]:
df['Job2'] = df['Job'].apply(Job2Num)

In [17]:
df.head(6)

Unnamed: 0,Job,BachTime,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering,Job1,Job2
0,"No, I'm not working at the moment",longer than 1 year ago but less than 3 years ago,MSIS,4,Yes,Yes,No,Yes,Yes,Yes,No,Yes,No,Yes,No,Yes,4,4,0.0,0.0
1,"Yes, I have a part-time job",over 5 years ago,MSIS,3,Yes,Yes,No,Yes,No,No,No,Yes,No,Yes,No,No,2,2,0.5,0.5
2,"No, I'm not working at the moment",longer than 3 years ago but less than 5 years ago,MSIS,3,No,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,3,3,0.0,0.0
3,"No, I'm not working at the moment",over 5 years ago,MSIS,3,Yes,No,No,Yes,Yes,No,Yes,Yes,No,Yes,No,Yes,2,3,0.0,0.0
4,"No, I'm not working at the moment",longer than 3 years ago but less than 5 years ago,MSIS,3,Yes,No,No,Yes,Yes,No,No,Yes,No,Yes,No,No,1,1,0.0,0.0
5,"Yes, I have a full-time job",longer than 1 year ago but less than 3 years ago,Supply Chain Mgmt & Analytics,1,No,No,No,No,No,No,No,Yes,No,Yes,No,Yes,1,1,1.0,1.0


### Another version of Job2Num

In [18]:
def Job2NumV2(Job_String):
    if Job_String.startswith('No'):
        return 0
    elif 'part-time' in Job_String:
        return 0.5
    else:
        return 1

#### Solution 3

Instead of declaring a function as above, we can pass a lambda (or anonymous) function

In [19]:
df['Job3'] = df['Job'].apply(lambda x: 0 if x.startswith('No') \
                            else 0.5 if 'part-time' in x \
                            else 1)

In [20]:
df.head(10)

Unnamed: 0,Job,BachTime,Program,ProgSkills,C,CPP,CS,Java,Python,JS,...,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering,Job1,Job2,Job3
0,"No, I'm not working at the moment",longer than 1 year ago but less than 3 years ago,MSIS,4,Yes,Yes,No,Yes,Yes,Yes,...,Yes,No,Yes,No,Yes,4,4,0.0,0.0,0.0
1,"Yes, I have a part-time job",over 5 years ago,MSIS,3,Yes,Yes,No,Yes,No,No,...,Yes,No,Yes,No,No,2,2,0.5,0.5,0.5
2,"No, I'm not working at the moment",longer than 3 years ago but less than 5 years ago,MSIS,3,No,No,No,Yes,Yes,No,...,Yes,No,Yes,No,Yes,3,3,0.0,0.0,0.0
3,"No, I'm not working at the moment",over 5 years ago,MSIS,3,Yes,No,No,Yes,Yes,No,...,Yes,No,Yes,No,Yes,2,3,0.0,0.0,0.0
4,"No, I'm not working at the moment",longer than 3 years ago but less than 5 years ago,MSIS,3,Yes,No,No,Yes,Yes,No,...,Yes,No,Yes,No,No,1,1,0.0,0.0,0.0
5,"Yes, I have a full-time job",longer than 1 year ago but less than 3 years ago,Supply Chain Mgmt & Analytics,1,No,No,No,No,No,No,...,Yes,No,Yes,No,Yes,1,1,1.0,1.0,1.0
6,"No, I'm not working at the moment",longer than 1 year ago but less than 3 years ago,MSIS,3,Yes,Yes,No,Yes,No,No,...,Yes,No,Yes,No,No,2,2,0.0,0.0,0.0
7,"No, I'm not working at the moment",< 1 year ago,MSIS,2,Yes,No,No,Yes,No,No,...,Yes,No,Yes,Yes,No,2,2,0.0,0.0,0.0
8,"Yes, I have a full-time job",over 5 years ago,MBA,1,No,No,No,No,No,No,...,No,No,Yes,No,Yes,1,1,1.0,1.0,1.0
9,"Yes, I have a part-time job",over 5 years ago,MSIS,3,Yes,No,No,Yes,Yes,No,...,Yes,No,Yes,Yes,Yes,2,1,0.5,0.5,0.5


#### Finalize

The DataFrame has now the original column <i>Job</i> and three identical columns <i>Job1</i>, <i>Job2</i>, and <i>Job3</i>. We delete the original column <i>Job</i>, as well as <i>Job1</i> and <i>Job2</i>, and then we will rename the remaining column from <i>Job3</i> to <i>Job</i>.

In [21]:
df.drop(['Job','Job1','Job2'], axis=1, inplace=True)

In [22]:
df.rename(columns={'Job3':'Job'}, inplace=True)

In [23]:
df.head(6)

Unnamed: 0,BachTime,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering,Job
0,longer than 1 year ago but less than 3 years ago,MSIS,4,Yes,Yes,No,Yes,Yes,Yes,No,Yes,No,Yes,No,Yes,4,4,0.0
1,over 5 years ago,MSIS,3,Yes,Yes,No,Yes,No,No,No,Yes,No,Yes,No,No,2,2,0.5
2,longer than 3 years ago but less than 5 years ago,MSIS,3,No,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,3,3,0.0
3,over 5 years ago,MSIS,3,Yes,No,No,Yes,Yes,No,Yes,Yes,No,Yes,No,Yes,2,3,0.0
4,longer than 3 years ago but less than 5 years ago,MSIS,3,Yes,No,No,Yes,Yes,No,No,Yes,No,Yes,No,No,1,1,0.0
5,longer than 1 year ago but less than 3 years ago,Supply Chain Mgmt & Analytics,1,No,No,No,No,No,No,No,Yes,No,Yes,No,Yes,1,1,1.0


### Replace <i>BachTime</i> with a dummy variable

The time from graduation <i>BachTime</i> can have the following values:
<ul>
<li>less than 1 year ago</li>
<li>longer than 1 year ago but less than 3 years ago</li>
<li>longer than 3 years ago but less than 5 years ago</li>
<li>over 5 years ago</li>
</ul>

We want to conver the column BachTime to <i>dummy variables</i>, that is we want to create four columns ('Bach_0to1', 'Bach_1to3', 'Bach_3to5', 'Bach_5Plus') of which only one will be 1 and the others 0. These columns will indicate which is the duration of the time from the bachelor's degree.

The method <i>get_dummies</i> performs precisely this task. It transforms a column that contains $V$ unique values into $V$ new column (one for each value $v \in V$).

Let's create two dummy dataframes to try out two different methods.

In [24]:
dumDF = pd.get_dummies(df, columns=['BachTime'])
dumDF1 = pd.get_dummies(df, columns=['BachTime'])

In [25]:
dumDF is dumDF1

False

In [26]:
dumDF1.head(3)

Unnamed: 0,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,...,Excel,Tableau,Regression,Classification,Clustering,Job,BachTime_< 1 year ago,BachTime_longer than 1 year ago but less than 3 years ago,BachTime_longer than 3 years ago but less than 5 years ago,BachTime_over 5 years ago
0,MSIS,4,Yes,Yes,No,Yes,Yes,Yes,No,Yes,...,Yes,No,Yes,4,4,0.0,0,1,0,0
1,MSIS,3,Yes,Yes,No,Yes,No,No,No,Yes,...,Yes,No,No,2,2,0.5,0,0,0,1
2,MSIS,3,No,No,No,Yes,Yes,No,No,Yes,...,Yes,No,Yes,3,3,0.0,0,0,1,0


The column names generated by the method <i>get_dummies</i> are very wordy because they use the original cell contents. So, let us rename the columns. We can use two different methods:

If want to see all columns, can use pd.set_option cmd

In [27]:
pd.set_option('display.max_columns',100)

### option 1:
<b>We are using dumDF1</b>

We can do it column by column with rename():

In [28]:
dumDF1.rename({'BachTime_< 1 year ago' \
               : 'Bach_0to1'},axis=1,inplace=True)

In [29]:
dumDF1.rename({'BachTime_longer than 1 year ago but less than 3 years ago'\
               : 'Bach_1to3'},axis=1,inplace=True)

In [30]:
dumDF1.rename({'BachTime_longer than 3 years ago but less than 5 years ago' \
               : 'Bach_3to5'},axis=1,inplace=True)

In [31]:
dumDF1.rename({'BachTime_over 5 years ago' : \
               'Bach_5Plus'},axis=1,inplace=True)

In [32]:
dumDF1.head(3)

Unnamed: 0,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering,Job,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus
0,MSIS,4,Yes,Yes,No,Yes,Yes,Yes,No,Yes,No,Yes,No,Yes,4,4,0.0,0,1,0,0
1,MSIS,3,Yes,Yes,No,Yes,No,No,No,Yes,No,Yes,No,No,2,2,0.5,0,0,0,1
2,MSIS,3,No,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,3,3,0.0,0,0,1,0


### Or, we do the following:

<b>We are using dumDF</b>
1. transform the columns into a list
2. modify the list
3. set the new columns vector all at once

In [33]:
newcolNames = dumDF.columns.tolist()

In [34]:
newcolNames[-4:] = ['Bach_0to1', 'Bach_1to3', 'Bach_3to5', 'Bach_5Plus']

In [35]:
dumDF.columns = newcolNames

In [36]:
dumDF.head(3)

Unnamed: 0,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering,Job,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus
0,MSIS,4,Yes,Yes,No,Yes,Yes,Yes,No,Yes,No,Yes,No,Yes,4,4,0.0,0,1,0,0
1,MSIS,3,Yes,Yes,No,Yes,No,No,No,Yes,No,Yes,No,No,2,2,0.5,0,0,0,1
2,MSIS,3,No,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,3,3,0.0,0,0,1,0


Overwrite our original dataframe df

In [37]:
df = dumDF

In [38]:
df.head(3)

Unnamed: 0,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering,Job,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus
0,MSIS,4,Yes,Yes,No,Yes,Yes,Yes,No,Yes,No,Yes,No,Yes,4,4,0.0,0,1,0,0
1,MSIS,3,Yes,Yes,No,Yes,No,No,No,Yes,No,Yes,No,No,2,2,0.5,0,0,0,1
2,MSIS,3,No,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,3,3,0.0,0,0,1,0


### Replace Yes with 1 and No with 0

Let us replace everywhere "Yes" with 1 and "No" with 0

In [39]:
df.replace(to_replace='Yes', value=1.0, inplace=True)

In [40]:
df.replace(to_replace='No', value=0.0, inplace=True)

In [41]:
df.head(3)

Unnamed: 0,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering,Job,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus
0,MSIS,4,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,4,4,0.0,0,1,0,0
1,MSIS,3,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,2,2,0.5,0,0,0,1
2,MSIS,3,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,3,3,0.0,0,0,1,0


### Adding simple columns

Let's add a column that counts how many programming languages students know. Languages = C+CPP+CS+Java+Python+JS+R+SQL+SAS

In [42]:
df['Languages'] = df.C+df.CPP+df.CS+df.Java+df.Python+df.JS+df.R+df.SQL+df.SAS

In [43]:
df.head(3)

Unnamed: 0,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering,Job,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus,Languages
0,MSIS,4,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,4,4,0.0,0,1,0,0,6.0
1,MSIS,3,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,2,2,0.5,0,0,0,1,4.0
2,MSIS,3,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,3,3,0.0,0,0,1,0,3.0


To rearrange column order:<br/>
<b>reindex(columns=[the columns in the order that you want])</b>

In [44]:
df.columns

Index(['Program', 'ProgSkills', 'C', 'CPP', 'CS', 'Java', 'Python', 'JS', 'R',
       'SQL', 'SAS', 'Excel', 'Tableau', 'Regression', 'Classification',
       'Clustering', 'Job', 'Bach_0to1', 'Bach_1to3', 'Bach_3to5',
       'Bach_5Plus', 'Languages'],
      dtype='object')

Let's place <i>Languages</i> first

In [45]:
list_of_columns = df.columns.tolist()

In [46]:
list_of_columns.insert(0, list_of_columns.pop(-1))

In [47]:
list_of_columns[:3]

['Languages', 'Program', 'ProgSkills']

In [48]:
df = df.reindex(columns=list_of_columns)

In [49]:
df.head(3)

Unnamed: 0,Languages,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering,Job,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus
0,6.0,MSIS,4,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,4,4,0.0,0,1,0,0
1,4.0,MSIS,3,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,2,2,0.5,0,0,0,1
2,3.0,MSIS,3,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,3,3,0.0,0,0,1,0


### Adding complex columns (advanced topic)

Let's add a 0-1 column called "Expert" if <i>Languages</i> is 3 or more.

First of all, note that you can do it very easily with what you already know:

In [50]:
df['Languages'] >= 3

0      True
1      True
2      True
3      True
4      True
5     False
6      True
7      True
8     False
9      True
10     True
11    False
12     True
13     True
14     True
15    False
16     True
17    False
18    False
19     True
20    False
21     True
22    False
23     True
24    False
25    False
26    False
27     True
28    False
29     True
      ...  
31    False
32     True
33     True
34     True
35     True
36     True
37    False
38     True
39     True
40     True
41     True
42     True
43     True
44     True
45     True
46     True
47     True
48     True
49     True
50     True
51     True
52     True
53     True
54     True
55     True
56     True
57    False
58     True
59     True
60     True
Name: Languages, Length: 61, dtype: bool

In [51]:
(df['Languages'] >= 3) + 0.0

0     1.0
1     1.0
2     1.0
3     1.0
4     1.0
5     0.0
6     1.0
7     1.0
8     0.0
9     1.0
10    1.0
11    0.0
12    1.0
13    1.0
14    1.0
15    0.0
16    1.0
17    0.0
18    0.0
19    1.0
20    0.0
21    1.0
22    0.0
23    1.0
24    0.0
25    0.0
26    0.0
27    1.0
28    0.0
29    1.0
     ... 
31    0.0
32    1.0
33    1.0
34    1.0
35    1.0
36    1.0
37    0.0
38    1.0
39    1.0
40    1.0
41    1.0
42    1.0
43    1.0
44    1.0
45    1.0
46    1.0
47    1.0
48    1.0
49    1.0
50    1.0
51    1.0
52    1.0
53    1.0
54    1.0
55    1.0
56    1.0
57    0.0
58    1.0
59    1.0
60    1.0
Name: Languages, Length: 61, dtype: float64

In [52]:
%timeit df['Expert1'] = (df['Languages']>= 3) + 0.0

308 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [53]:
df['Expert1'][:10]

0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
5    0.0
6    1.0
7    1.0
8    0.0
9    1.0
Name: Expert1, dtype: float64

But suppose that the calculation is complicated. First, we need to define a function that given a row (i.e., a Series that represents a student) returns 1 if the student is and expert (and 0 otherwise). This row has index labels 'Job', 'BachTime, 'Program', etc)

In [54]:
def ExpertFunction(row):
    if row['Languages'] >= 3:
        return 1
    else:
        return 0

Second, we use the function <b>apply</b> with axis=1, which applies the function across columns and returns a Series:

In [55]:
df['Expert2']=df.apply(ExpertFunction, axis=1)

In [56]:
df.head(3)

Unnamed: 0,Languages,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering,Job,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus,Expert1,Expert2
0,6.0,MSIS,4,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,4,4,0.0,0,1,0,0,1.0,1
1,4.0,MSIS,3,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,2,2,0.5,0,0,0,1,1.0,1
2,3.0,MSIS,3,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,3,3,0.0,0,0,1,0,1.0,1


Alternatively, instead of defining a function with <i>def</i>, we can use a lambda function, which allows us to define the function "inline"

In [57]:
df['Expert3'] = df.apply(lambda row: 1 if row['Languages'] >= 3 \
                                    else 0, axis=1)

In [58]:
df.head(3)

Unnamed: 0,Languages,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering,Job,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus,Expert1,Expert2,Expert3
0,6.0,MSIS,4,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,4,4,0.0,0,1,0,0,1.0,1,1
1,4.0,MSIS,3,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,2,2,0.5,0,0,0,1,1.0,1,1
2,3.0,MSIS,3,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,3,3,0.0,0,0,1,0,1.0,1,1


In [59]:
df.drop(['Expert3','Expert2','Expert1'],axis=1, inplace=True)

In [60]:
df.head(3)

Unnamed: 0,Languages,Program,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering,Job,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus
0,6.0,MSIS,4,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,4,4,0.0,0,1,0,0
1,4.0,MSIS,3,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,2,2,0.5,0,0,0,1
2,3.0,MSIS,3,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,3,3,0.0,0,0,1,0


<b>Warning</b>: <i>DataFrame.apply</i> is slow, especially when called on all rows (i.e., axis=1). If you can, you should use operations among Series and scalars. The performance on the example above can be improved a lot as follows:

In [61]:
%timeit df['Expert'] = (df.Languages >= 3) + 0.0

300 µs ± 7.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [62]:
%timeit df['Expert'] = df.apply(lambda row : 1 if row['Languages'] >= 3 \
                                else 0, axis=1)

1.12 ms ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### DataFrame.apply vs Series.apply

In pandas 0.20, DataFrame.apply is slow whereas Series.apply is fast. For example, say that we want to add a column <i>ProgramLower</i> with the lower case value of Program.

#### Solution 1: DataFrame.apply (slow! Do not use!)

In [63]:
%timeit df.apply(lambda x: x['Program'].lower(), axis=1)

1.05 ms ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


#### Solution 2: Series.apply (fast because it uses vectorization)

In [64]:
%timeit df.Program.apply(lambda y: y.lower())

120 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


### Write it to a file

In [95]:
df.to_csv('cleaned_survey.csv')

### Summary Functions

pandas provides many simple "summary functions" which restructure the data in some useful way. 

### .describe()

In [66]:
df.describe()

Unnamed: 0,Languages,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering,Job,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus,Expert
count,56.0,61.0,61.0,61.0,60.0,61.0,60.0,60.0,60.0,60.0,60.0,61.0,61.0,60.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0
mean,3.785714,2.852459,0.590164,0.442623,0.083333,0.737705,0.433333,0.383333,0.2,0.85,0.1,0.95082,0.557377,0.516667,1.868852,1.836066,0.352459,0.032787,0.245902,0.262295,0.459016,0.754098
std,1.744751,0.980409,0.495885,0.500819,0.278718,0.443533,0.499717,0.490301,0.403376,0.360085,0.302532,0.218039,0.500819,0.503939,0.956999,0.986244,0.411747,0.179556,0.434194,0.443533,0.502453,0.434194
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,4.0,3.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,5.0,3.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,2.0,2.0,0.5,0.0,0.0,1.0,1.0,1.0
max,7.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0


This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the dtype of the input. The output above only makes sense for numerical data; for string data here's what we get:

In [81]:
df.Program.describe()

count       61
unique       6
top       MSIS
freq        40
Name: Program, dtype: object

To see a list of unique values we can use the **unique** function:

In [84]:
df.Program.unique()

array(['MSIS', 'Supply Chain Mgmt & Analytics', 'MBA', 'Faculty!',
       'Business Man', 'Master of Finance'], dtype=object)

To see a list of unique values and how often they occur in the dataset, we can use the **value_counts** method:

In [85]:
df.Program.value_counts()

MSIS                             40
MBA                              16
Supply Chain Mgmt & Analytics     2
Master of Finance                 1
Business Man                      1
Faculty!                          1
Name: Program, dtype: int64

versus .count()

In [86]:
df.Program.count()

61

**.shape** Return a tuple representing the dimensionality of the DataFrame.

In [92]:
df.shape

(61, 23)

## Problems

How many students know SQL?

In [67]:
(df['SQL'] == 1).sum()

51

In [68]:
df.SQL.sum()

51.0

What's the average programming skills of MSIS students? Compare it to that of MBA students

In [69]:
df[df.Program == 'MSIS'].ProgSkills.mean()
df[df.Program == 'MBA'].ProgSkills.mean()

3.075

2.5

How many students know classification better than clustering? And how many clustering better than classification?

In [70]:
(df.Classification > df.Clustering).sum()

7

In [71]:
(df.Clustering > df.Classification).sum()

6

Correlation

In [73]:
df.corr()

Unnamed: 0,Languages,ProgSkills,C,CPP,CS,Java,Python,JS,R,SQL,SAS,Excel,Tableau,Regression,Classification,Clustering,Job,Bach_0to1,Bach_1to3,Bach_3to5,Bach_5Plus,Expert
Languages,1.0,0.689637,0.526305,0.547554,0.355174,0.596253,0.503834,0.6293,0.111712,0.628206,0.111237,-0.258809,0.17955,0.087101,0.102668,0.107504,-0.036692,-0.08745,0.121601,-0.104506,0.019493,0.751168
ProgSkills,0.689637,1.0,0.559183,0.440713,0.100473,0.522768,0.339692,0.241747,-0.221039,0.505797,0.057335,-0.112476,0.170276,-0.113983,0.174432,0.146937,-0.137397,-0.161415,0.008344,0.013823,0.038271,0.500636
C,0.526305,0.559183,1.0,0.541281,0.123091,0.336652,0.056851,-0.055978,-0.338062,0.307698,0.045361,-0.035378,0.062709,-0.208587,0.095573,0.030727,-0.13783,0.153429,0.011421,0.042237,-0.101983,0.375618
CPP,0.547554,0.440713,0.541281,1.0,0.212121,0.231244,-0.0181,0.251501,-0.269069,0.192336,-0.07817,0.050042,-0.003268,-0.029165,-0.08551,-0.154333,-0.001325,0.021268,0.027642,-0.156213,0.106407,0.278938
CS,0.355174,0.100473,0.123091,0.212121,1.0,0.174078,0.168023,0.268767,-0.145668,0.120517,-0.102383,0.069171,-0.101409,0.055709,-0.084029,-0.128912,-0.031172,-0.055989,-0.034816,-0.181818,0.212121,0.023762
Java,0.596253,0.522768,0.336652,0.231244,0.174078,1.0,0.147043,0.29687,-0.263822,0.48553,-0.06415,-0.135613,0.444037,-0.356987,-0.082393,-0.061836,-0.261071,-0.099493,0.080869,0.186111,-0.198615,0.524942
Python,0.503834,0.339692,0.056851,-0.0181,0.168023,0.147043,1.0,0.260878,0.163203,0.281636,-0.175032,-0.262347,0.047324,0.196751,-0.018747,-0.003431,-0.026029,-0.162386,0.038837,0.00507,0.020282,0.243865
JS,0.6293,0.241747,-0.055978,0.251501,0.268767,0.29687,0.260878,1.0,0.152753,0.24246,-0.148545,-0.290978,0.136048,0.090723,-0.006238,-0.014507,0.054577,-0.146408,0.178122,-0.165371,0.044788,0.272862
R,0.111712,-0.221039,-0.338062,-0.269069,-0.145668,-0.263822,0.163203,0.152753,1.0,-0.136963,0.126903,-0.267652,-0.050252,0.483602,0.236769,0.374065,0.182349,-0.092848,0.096225,-0.113067,0.050252,-0.118217
SQL,0.628206,0.505797,0.307698,0.192336,0.120517,0.48553,0.281636,0.24246,-0.136963,1.0,0.129099,-0.096374,0.276776,-0.120004,0.194014,0.030952,-0.145441,0.078008,0.026948,-0.06333,0.004691,0.651109


## Find the strongest correlations

What are the top 10 correlations and the top 10 anti-correlations? 

To answer this question, we need to "stack" the result of <i>cor.stack()</i>, which means turning it into a Series with a "Hierarchical index" (that is, an index of two elements).

In [74]:
cor = df.corr()
cor.stack()

Languages   Languages         1.000000
            ProgSkills        0.689637
            C                 0.526305
            CPP               0.547554
            CS                0.355174
            Java              0.596253
            Python            0.503834
            JS                0.629300
            R                 0.111712
            SQL               0.628206
            SAS               0.111237
            Excel            -0.258809
            Tableau           0.179550
            Regression        0.087101
            Classification    0.102668
            Clustering        0.107504
            Job              -0.036692
            Bach_0to1        -0.087450
            Bach_1to3         0.121601
            Bach_3to5        -0.104506
            Bach_5Plus        0.019493
            Expert            0.751168
ProgSkills  Languages         0.689637
            ProgSkills        1.000000
            C                 0.559183
            CPP          

Remove the correlations equal to 1 (they are self correlations); then, pick one correlation every two (as they all appear twice)

Use list slicing: x[startAt:endBefore:skip]

In [75]:
cor[cor < 1].stack().nlargest(10)

Classification  Clustering        0.859764
Clustering      Classification    0.859764
Languages       Expert            0.751168
Expert          Languages         0.751168
Languages       ProgSkills        0.689637
ProgSkills      Languages         0.689637
SQL             Expert            0.651109
Expert          SQL               0.651109
Languages       JS                0.629300
JS              Languages         0.629300
dtype: float64

In [76]:
cor[cor < 1].stack().nlargest(20)[::2]

Classification  Clustering    0.859764
Languages       Expert        0.751168
                ProgSkills    0.689637
SQL             Expert        0.651109
Languages       JS            0.629300
                SQL           0.628206
                Java          0.596253
ProgSkills      C             0.559183
Languages       CPP           0.547554
C               CPP           0.541281
dtype: float64

Do the same for negative correlations

In [77]:
cor[cor < 1].stack().nsmallest(20)[::2]

Bach_3to5   Bach_5Plus   -0.549257
Bach_1to3   Bach_5Plus   -0.526004
Tableau     Regression   -0.405591
Excel       Bach_1to3    -0.398272
Job         Bach_3to5    -0.377767
Java        Regression   -0.356987
Bach_1to3   Bach_3to5    -0.340503
C           R            -0.338062
Tableau     Job          -0.321969
Regression  Bach_3to5    -0.321791
dtype: float64

Least significant correlations


In [78]:
LeastCorr = cor[cor<1].stack().abs().nsmallest(20)[::2]

In [79]:
cor[cor<1].stack()[LeastCorr.index]

CPP             Job              -0.001325
                Tableau          -0.003268
Python          Clustering       -0.003431
Classification  Bach_3to5         0.003862
SQL             Bach_5Plus        0.004691
Python          Bach_3to5         0.005070
Regression      Bach_0to1        -0.006193
JS              Classification   -0.006238
ProgSkills      Bach_1to3         0.008344
C               Bach_1to3         0.011421
dtype: float64