In [1]:
import pandas as pd
df = pd.read_csv('../datasets/ASE-B Survey.csv')

In [2]:
df.shape

(69, 14)

That's a good amount of answers to the survey.

In [3]:
df.describe()

Unnamed: 0,Which year are you in currently?,"On a scale of 1-10, how excited do you get when you hear the term ""Data Science"" ?","On a scale of 1-10, how serious are you about a career in Data Science?","On a scale of 1-10, how much do you fear math and statistics?","Would you like to share a link of your data science project? It can be a website or your github repo or a blog article. If yes, kindly paste the link beneath."
count,69.0,69.0,69.0,69.0,0.0
mean,1.913043,8.376812,7.985507,4.072464,
std,0.919385,1.600565,1.866837,2.56285,
min,1.0,3.0,2.0,1.0,
25%,1.0,8.0,7.0,2.0,
50%,2.0,9.0,8.0,3.0,
75%,2.0,10.0,10.0,7.0,
max,4.0,10.0,10.0,10.0,


Looks like we have an unanswered attribute.  
Let's get rid of that.

In [4]:
del df['Would you like to share a link of your data science project? It can be a website or your github repo or a blog article. If yes, kindly paste the link beneath.']

Are all the column names that huge?

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 13 columns):
Timestamp                                                                                                                                                                     69 non-null object
Which year are you in currently?                                                                                                                                              69 non-null int64
On a scale of 1-10, how excited do you get when you hear the term "Data Science" ?                                                                                            69 non-null int64
Which of the following options best describes you during these holidays?                                                                                                      69 non-null object
If you are learning Data Science currently, kindly mention the sources from which you are learning(Separate each source with a 

The column names are long and impossible to read.  
Let's rename them for a start.

In [6]:
old_columns = df.columns

new_columns = []  
for i in old_columns:  
    new_columns.append(str(input('What will you replace '+i+' with : ')))  
  
A code i used to simplify renaming of the long column names.  
Saved a lot of scrolling around.  

In [7]:
new_columns = ['Timestamp',
 'Year',
 'Excitement_scale',
 'Holidays',
 'Learning_sources(DS)',
 'Learning_sources(Python)',
 'SIG_Aware',
 'Winter_codex_aware',
 'Learning_preference',
 'Career_seriousness_scale',
 'Math_stat_fear',
 'Projects',
 'E-mail']

In [8]:
df.columns = new_columns

Column names were successfully renamed.  
So, let's have another look on the quick stats.

In [9]:
df.describe()

Unnamed: 0,Year,Excitement_scale,Career_seriousness_scale,Math_stat_fear
count,69.0,69.0,69.0,69.0
mean,1.913043,8.376812,7.985507,4.072464
std,0.919385,1.600565,1.866837,2.56285
min,1.0,3.0,2.0,1.0
25%,1.0,8.0,7.0,2.0
50%,2.0,9.0,8.0,3.0
75%,2.0,10.0,10.0,7.0
max,4.0,10.0,10.0,10.0


Year attribute stores categorical data. We are not supposed to have a mathematical approach towards analysing the attribute.  
So let's change the data points to roman numbers.

In [10]:
def Year_rename(year):
    if (int(year) == 4): return 'IV'
    else :return 'I' * int(year)

df['Year'] = df['Year'].apply(Year_rename)

That's done so let's get a look at the head.

In [11]:
df.head()

Unnamed: 0,Timestamp,Year,Excitement_scale,Holidays,Learning_sources(DS),Learning_sources(Python),SIG_Aware,Winter_codex_aware,Learning_preference,Career_seriousness_scale,Math_stat_fear,Projects,E-mail
0,11/5/2019 21:51:27,II,10,Working on Data Science Projects,"Udemy , blog posts.","Codecademy, Solo Learn, Whatever I can find on...",Yes,Yes,Online Courses,10,2,No,ritwiklal2000@gmail.com
1,11/5/2019 21:53:49,II,9,Learning Data Science,I watch YouTube videos because I like visual s...,Whatever I can find on Google,Yes,No,Online Courses,8,3,No,meghanarao.99@gmail.com
2,11/5/2019 21:55:07,II,10,I am stuck with other work and have been unabl...,,"Whatever I can find on Google, The In-house ""W...",Yes,Yes,Project-based,10,1,No,
3,11/5/2019 21:55:16,I,10,Learning Data Science,"Sig, online books, orielly","The In-house ""Winter CodeX 2018-Python Edition...",Yes,Yes,Book-based,10,2,Yes,adityadoranala2@gmail.com
4,11/5/2019 21:55:24,I,10,Learning Python,"Udacity,udemy","Real Python, The In-house ""Winter CodeX 2018-P...",Yes,Yes,Online Courses,8,4,No,snehabalaji74@gmail.com


Timestamp is totally hidden due to its str datatype and needs some personalization.

In [12]:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

Changing the timestamp into datetime datatype to unlock a lot of functionality.

In [13]:
df['time'] = df['Timestamp'].dt.hour + df['Timestamp'].dt.minute/100
df['day'] = df['Timestamp'].dt.day

Here, we take the values into a slightly different form for ease of plotting later on.  
Now that that's done, let's look at the tail this time.

In [14]:
df.tail()

Unnamed: 0,Timestamp,Year,Excitement_scale,Holidays,Learning_sources(DS),Learning_sources(Python),SIG_Aware,Winter_codex_aware,Learning_preference,Career_seriousness_scale,Math_stat_fear,Projects,E-mail,time,day
64,2019-11-08 22:02:47,III,8,Learning Python,"W3schools, sap course",Whatever I can find on Google,No,No,Project-based,8,5,No,mallikakhan9000@gmail.com,22.02,8
65,2019-11-09 18:18:36,IV,8,I am stuck with other work and have been unabl...,,Whatever I can find on Google,No,No,Online Courses,10,7,No,,18.18,9
66,2019-11-09 18:22:57,IV,8,I am stuck with other work and have been unabl...,,Whatever I can find on Google,No,No,Online Courses,8,7,No,,18.22,9
67,2019-11-09 18:24:40,IV,10,Learning Data Science,Coursera,Whatever I can find on Google,No,Yes,Online Courses,10,1,Yes,srikartondapu13@gmail.com,18.24,9
68,2019-11-09 18:24:47,IV,9,Learning Data Science,"Course era, gcp","Solo Learn, Whatever I can find on Google","Heard of it, but I thought it was like a ""Unic...",Yes,Online Courses,9,3,No,naveenkumar.nattanmy@gmail.com,18.24,9


Everybody had their own wide range of opinions on Learning sources. So, it needs some unstacking for later analysis.

In [15]:
df['Learning_sources(Python)'].value_counts(dropna=False)

Whatever I can find on Google                                                                                              18
NaN                                                                                                                        10
Whatever I can find on Google, The In-house "Winter CodeX 2018-Python Edition" beginner's guide                             4
Solo Learn, Whatever I can find on Google                                                                                   4
Udemy                                                                                                                       2
Solo Learn, Whatever I can find on Google, The In-house "Winter CodeX 2018-Python Edition" beginner's guide                 2
Coursera                                                                                                                    1
Hacker rank,Mosh hamedami                                                                                             

A good amount of null values here shows that not all the students have learnt python.  
Let's focus on who responded.

In [16]:
py = df['Learning_sources(Python)'][0].split(',')
for i in range(1,69):
    if (str(df['Learning_sources(Python)'][i]) == 'nan'):
        continue
    py += str(df['Learning_sources(Python)'][i]).split(',')

In [17]:
for i in range(len(py)):
    py[i] = py[i].strip()
    py[i] = py[i].lower()
    py[i] = py[i].capitalize()

In [18]:
py.sort()
for i in range(4,13):
    py[i] = 'Others'
py[48] = 'Others'
py[57] = 'Others'

In [19]:
for i in range(len(py)):
    if py[i] == 'The in-house "winter codex 2018-python edition" beginner\'s guide':
        py[i] = 'Winter Codex'
    if py[i] == 'Whatever i can find on google':
        py[i] = 'Google'

In [20]:
Py = pd.DataFrame(py)

The attribute had multiple entries which were separated by a comma.So, I made a list of the
seperated values which i later converted to a Series object in pandas to use various functions of
pandas upon these.Thus this attribute is clean enough to work on.

In [21]:
e = []
for i in range(69):
    if str(df['E-mail'][i]) == 'nan':
        e.append('E-Mail ID not given')
    else:
        e.append('E-Mail ID given')

In [22]:
df['SS'] = pd.DataFrame(e)

We make a new attribute saying whether E-mail was given or not given.

In [23]:
Py.to_csv('Py_learn')

In [24]:
df.to_csv('Clean_survey')

So, now that all the cleanings done, we take the dataset out as clean_survey for later use.  
Basically the cleaning part is abstracted out only for interested users to view.

### That's pretty much it on me re-cleaning this survey data