# Python for Data Science Practice Session 1 : Mathematics and Statistics

## A walk through history

In this session, we are going to analyse a dataset containing Mathematicians that have helped shape what Maths - as well as many other fields - have become today. This is a great chance for you to learn more about previous Mathematicians starting from where they're from, the fields they were interested in, the occupations they had and many other stuff that you can explore.

We start by importing the libraries that we are going to need for this project:

In [3]:
# Import pandas  
import pandas as pd

In [4]:
# This is a library that is used in one of the pre-written codes
import itertools

In [6]:
# Upload the file named mathematicians.csv into a dataframe
df = pd.read_csv('mathematicians.csv')

In [7]:
# View the dataframe
df

Unnamed: 0,mathematicians,occupation,country of citizenship,place of birth,date of death,educated at,employer,place of death,member of,employer.1,...,instance of,sex or gender,approx. date of birth,day of birth,month of birth,year of birth,approx. date of death,day of death,month of death,year of death
0,Roger Joseph Boscovich,"['physicist', 'astronomer', 'mathematician', '...",['Republic of Ragusa'],"Dubrovnik, Republic of Ragusa",13 February 1787,['Pontifical Gregorian University'],['Pontifical Gregorian University'],"['Milan', 'Habsburg Empire']","['Royal Society', 'Russian Academy of Sciences...",['Pontifical Gregorian University'],...,['human'],['male'],False,18.0,May,1711,False,13.0,February,1787
1,Emma Previato,['mathematician'],"['United States of America', 'Italy']",Badia Polesine,,"['Harvard University', 'University of Padua']","['Boston University', 'University of Padua']",,['American Mathematical Society'],"['Boston University', 'University of Padua']",...,['human'],['female'],False,,,1952,False,,,
2,Feodor Deahna,['mathematician'],,,1844,,,,,,...,['human'],['male'],False,,,1815,False,,,1844
3,Denis Henrion,"['publisher', 'mathematician']",['France'],,1640,,,,,,...,['human'],['male'],True,,,1500,False,,,1640
4,Henri Delannoy,"['mathematician', 'historian', 'military perso...",['France'],Bourbonne-les-Bains,5 February 1915,['École Polytechnique'],,['Guéret'],,,...,['human'],['male'],False,28.0,September,1833,False,5.0,February,1915
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8591,Hugh Hamilton (bishop),['priest'],,County Dublin,1 December 1805,['Trinity College Dublin'],,['Kilkenny'],['Royal Society'],,...,['human'],['male'],False,26.0,March,1729,False,1.0,December,1805
8592,Boáz Klartag,"['mathematician', 'university teacher']",['Israel'],,,['Tel Aviv University'],,,,,...,['human'],['male'],False,25.0,April,1978,False,,,
8593,Alain M. Robert,['mathematician'],['Switzerland'],,,,,,,,...,['human'],['male'],False,15.0,October,1941,False,,,
8594,Paul C. Rosenbloom,['mathematician'],['United States of America'],Portsmouth,May 2005,,,,,,...,['human'],['male'],False,31.0,March,1920,False,,May,2005


You might be wondering what NaN stands for. Well, NaN is an abbreviation for "Not a Number", and it is a member of the numeric data type but it is undefined or unrepresentable on the computer. Take 0/0 as an example, which is undefined. NaNs are treated as missing values by Python. 

This dataset is a great example on how real life datasets might look like, where datasets are not perfect and you will have to find a way to deal with those missing values.

Anyways, let's proceed for now with getting a taste of what the dataset looks like and we'll get back to dealing with missing values in a bit.

In [8]:
# Output the dimensions of the dataframe
df.shape

(8596, 29)

In [9]:
# Output the data of 10 random mathematicians
df.sample(10)

Unnamed: 0,mathematicians,occupation,country of citizenship,place of birth,date of death,educated at,employer,place of death,member of,employer.1,...,instance of,sex or gender,approx. date of birth,day of birth,month of birth,year of birth,approx. date of death,day of death,month of death,year of death
4945,Richard M. Goodwin,"['mathematician', 'economist']",['United States of America'],Indiana,13 August 1996,['Harvard University'],['Harvard University'],['Siena'],,['Harvard University'],...,['human'],['male'],False,24.0,February,1913.0,False,13.0,August,1996.0
1641,Anthony Bartholomay,[''],,,21 March 1975,,,,,,...,['human'],['male'],False,11.0,August,1919.0,False,21.0,March,1975.0
1435,Arthur Burks,['mathematician'],['United States of America'],Duluth,14 May 2008,"['University of Michigan', 'DePauw University']",['University of Michigan'],['Ann Arbor'],,['University of Michigan'],...,['Q5'],['male'],False,13.0,October,1915.0,False,14.0,May,2008.0
814,Stan van Hoesel,['mathematician'],,,,,['Maastricht University'],,,['Maastricht University'],...,['human'],['male'],False,,,1961.0,False,,,
4478,Yuri Manin,"['mathematician', 'university teacher']","['Soviet Union', 'Germany']","Simferopol, Russian Soviet Federative Socialis...",,['Moscow State University'],"['Northwestern University', '2002']",,"['German Academy of Sciences Leopoldina', 'Fre...","['Northwestern University', '2002']",...,['human'],['male'],False,16.0,February,1937.0,False,,,
4545,Harold W. Kuhn,"['mathematician', 'writer', 'university teache...",['United States of America'],Santa Monica,2 July 2014,,['Princeton University'],['New York City'],['American Academy of Arts and Sciences'],['Princeton University'],...,['human'],['male'],False,29.0,July,1925.0,False,2.0,July,2014.0
6659,Michael Sadowsky,['mathematician'],,,31 December 1967,,,,,,...,['human'],['male'],False,,,1902.0,False,31.0,December,1967.0
6144,André Haefliger,"['mathematician', 'university teacher']",['Switzerland'],Nyon,,['University of Strasbourg'],,,,,...,['human'],['male'],False,22.0,May,1929.0,False,,,
5277,C. P. Ramanujam,['mathematician'],['India'],Chennai,27 October 1974,['University of Madras'],,['Bangalore'],,,...,['human'],['male'],False,22.0,December,1887.0,False,27.0,October,1974.0
7460,S. L. Hakimi,['mathematician'],['United States of America'],,2005,"['University of Illinois system', 'University ...",['Northwestern University'],,,['Northwestern University'],...,['human'],['male'],False,,,,False,,,2005.0


Not all the column headings are visible in the previous output due to the large amount of columns. 

Let us output the column headings:

In [10]:
# View the column headings
df.columns.values.tolist()

['mathematicians',
 'occupation',
 'country of citizenship',
 'place of birth',
 'date of death',
 'educated at',
 'employer',
 'place of death',
 'member of',
 'employer.1',
 'doctoral advisor',
 'languages spoken, written or signed',
 'academic degree',
 'doctoral student',
 'manner of death',
 'position held',
 'field of work',
 'award received',
 'Erdős number',
 'instance of',
 'sex or gender',
 'approx. date of birth',
 'day of birth',
 'month of birth',
 'year of birth',
 'approx. date of death',
 'day of death',
 'month of death',
 'year of death']

In [11]:
# Drop the following columns: 'Erdős number' , 'instance of', 'approx. date of birth' and 'approx. date of death'
df.drop(['Erdős number', 'instance of', 'approx. date of birth', 'approx. date of death'], axis=1)

Unnamed: 0,mathematicians,occupation,country of citizenship,place of birth,date of death,educated at,employer,place of death,member of,employer.1,...,position held,field of work,award received,sex or gender,day of birth,month of birth,year of birth,day of death,month of death,year of death
0,Roger Joseph Boscovich,"['physicist', 'astronomer', 'mathematician', '...",['Republic of Ragusa'],"Dubrovnik, Republic of Ragusa",13 February 1787,['Pontifical Gregorian University'],['Pontifical Gregorian University'],"['Milan', 'Habsburg Empire']","['Royal Society', 'Russian Academy of Sciences...",['Pontifical Gregorian University'],...,,,['Fellow of the Royal Society'],['male'],18.0,May,1711,13.0,February,1787
1,Emma Previato,['mathematician'],"['United States of America', 'Italy']",Badia Polesine,,"['Harvard University', 'University of Padua']","['Boston University', 'University of Padua']",,['American Mathematical Society'],"['Boston University', 'University of Padua']",...,,,,['female'],,,1952,,,
2,Feodor Deahna,['mathematician'],,,1844,,,,,,...,,['differential geometry'],,['male'],,,1815,,,1844
3,Denis Henrion,"['publisher', 'mathematician']",['France'],,1640,,,,,,...,,,,['male'],,,1500,,,1640
4,Henri Delannoy,"['mathematician', 'historian', 'military perso...",['France'],Bourbonne-les-Bains,5 February 1915,['École Polytechnique'],,['Guéret'],,,...,,,,['male'],28.0,September,1833,5.0,February,1915
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8591,Hugh Hamilton (bishop),['priest'],,County Dublin,1 December 1805,['Trinity College Dublin'],,['Kilkenny'],['Royal Society'],,...,,,['Fellow of the Royal Society'],['male'],26.0,March,1729,1.0,December,1805
8592,Boáz Klartag,"['mathematician', 'university teacher']",['Israel'],,,['Tel Aviv University'],,,,,...,,,"['Salem Prize', '2008']",['male'],25.0,April,1978,,,
8593,Alain M. Robert,['mathematician'],['Switzerland'],,,,,,,,...,,,,['male'],15.0,October,1941,,,
8594,Paul C. Rosenbloom,['mathematician'],['United States of America'],Portsmouth,May 2005,,,,,,...,,,['John Simon Guggenheim Memorial Foundation Fe...,['male'],31.0,March,1920,,May,2005


In [12]:
# Output the number of columns
#len(df.columns.values.tolist())
df.columns.size

29

- - - - 

## Let's talk about missing values

So we have a total of 25 columns. Let's check how many missing values each column has individually.

In [13]:
# Output the total number of missing values per column
df.isna().sum()

mathematicians                            0
occupation                                0
country of citizenship                 1721
place of birth                         3026
date of death                          3702
educated at                            3869
employer                               5086
place of death                         5436
member of                              5505
employer.1                             5086
doctoral advisor                       6145
languages spoken, written or signed    6269
academic degree                        8097
doctoral student                       8292
manner of death                        8373
position held                          8377
field of work                          6163
award received                         6249
Erdős number                           7122
instance of                               0
sex or gender                            28
approx. date of birth                     0
day of birth                    

 - - - - - -
As you can see, most of the columns have a large amount of missing values which is an issue. After all, the analysis you make is as good as the data that you have. What this means in our case is that due to the large amount of missing values per column, the analysis that we do might not be of the best quality.

One of the problems that we could face is having a pattern in the missing values. An example would be that older Mathematicians might have less information associated with them compared to the most recent ones (I will leave this for you to explore on your own to check and see if this is the case here if you are interested). As a result, this might introduce bias in your analysis, where the outputs you get from your analysis are not the best representatives for the older mathematicians because most of the data used in your analysis will be associated with recent Mathematicians, hence biased towards recent Mathematicians.

What we would like to have is a missing value free dataset with a broad variety of Mathematicians. 
- - - - - - - - - - - - - - -
In the next task, I would like for you to pick a random mathematician and fill in the missing values by doing a quick research. 


In [12]:
# The following steps that you should follow are: 

# 1) Pick a random mathematician 
# 2) Do a quick research on them
# 3) Fill in their missing data manually in the dataframe







In a real world case scenario, we will not fill in each missing value manually as it is extremely in-efficient.

A solution to this is called Web Scraping, which is the technique used to automatically scrape data from the internet using Python libraries, with the most popular one being called Beautiful Soup. It is extremely powerful in the sense that it is able to collect large sums of data by the click of a button, saving the massive amount of time that is needed to collect the data manually.

Even though Web Scraping is beyond the scope of this course, I would recommend reading more about it on your own. You can find more about Web Scraping and Beautiful Soup here: https://realpython.com/beautiful-soup-web-scraper-python/ 

Keep in mind that there are different types of missing values (eg: numeric missing values) and different ways of dealing with them. We will return back to this topic in the upcoming sessions.
- - - - -

----


## List of occupations

In [14]:
# View the occupation column
df[['occupation']]

Unnamed: 0,occupation
0,"['physicist', 'astronomer', 'mathematician', '..."
1,['mathematician']
2,['mathematician']
3,"['publisher', 'mathematician']"
4,"['mathematician', 'historian', 'military perso..."
...,...
8591,['priest']
8592,"['mathematician', 'university teacher']"
8593,['mathematician']
8594,['mathematician']


Lots of the Mathematicians worked multiple occupations as you can see. Let us analyse the different occupations in the dataset.

In [17]:
# View the unique entries in the occupation column, then save it in a list called unique_occ
unique_occ = df.occupation.unique()

In [18]:
# View unique_occ
unique_occ

array(["['physicist', 'astronomer', 'mathematician', 'philosopher', 'diplomat', 'poet', 'theologian', 'priest', 'polymath', 'historian', 'scientist', 'writer', 'cleric', 'university teacher']",
       "['mathematician']", "['publisher', 'mathematician']", ...,
       "['politician', 'physicist', 'mathematician', 'nuclear scientist']",
       "['mathematician', 'astronomer', 'philosopher', 'astrologer']",
       "['mathematician', 'professeur des universités', 'researcher']"],
      dtype=object)

the occupation "mathematician" is repeated in both ['mathematician'] and ['publisher', 'mathematician'] in unique_occ. This is because the entries are different, and so they are treated as unique results. This deems the <b>unique()</b> function useless in outputting a list of the unique occupations. 

In the following tasks, we will work towards outputting a list of the unique occupations.

In [19]:
# This code created a list called flat_list which takes each entry in df.occupation.unique(), splits each occupation  
# from the other occupations and puts it in its own entry
item_list = []
for item in unique_occ:
    item = item.split(",")
    item_list.append(item)
flat_list = itertools.chain(*item_list)
flat_list = list(flat_list)

In [21]:
# Print the first 30 entries of flat_list
flat_list[0:30]

["['physicist'",
 " 'astronomer'",
 " 'mathematician'",
 " 'philosopher'",
 " 'diplomat'",
 " 'poet'",
 " 'theologian'",
 " 'priest'",
 " 'polymath'",
 " 'historian'",
 " 'scientist'",
 " 'writer'",
 " 'cleric'",
 " 'university teacher']",
 "['mathematician']",
 "['publisher'",
 " 'mathematician']",
 "['mathematician'",
 " 'historian'",
 " 'military personnel']",
 "['mathematician'",
 " 'university teacher']",
 "['']",
 "['mathematician'",
 " 'theoretical physicist'",
 " 'inventor'",
 " 'academic'",
 " 'non-fiction writer'",
 " 'physicist']",
 "['civil engineer'"]

In [22]:
# Compare the length unique_occ with flat_list 
print("Length of unique_occ is", len(unique_occ), "while length of flat_list is", len(flat_list) )

Length of unique_occ is 1443 while length of flat_list is 5289


As you can see in the first 30 entries of flat_list, the format still has a few issues. For example, we have 'mathematician', ['mathematician'] and 'mathematician'] which all represent the same thing but are written in different ways. They are going to be treated by Python as distinct entries when trying to filter out the distinct values from the list.

To solve this, we will loop over the entries in flat_list and use a function called <b>replace</b> that will help remove unwanted characters in order for all of the entries to have a unified format. Read more about how to use it here: https://www.w3schools.com/python/ref_string_replace.asp 

I would recommend solving this issue by removing each unwanted character individually, checking out how the output looks like afterwards, identifying another character that needs to be removed and then repeating the process until we end up with a unified format for all the entries. This helps in making the process much simpler. 

One thing to keep in mind is that Python is really sensitive to cases and spacings. So, 'Mathematician' isn't the same as 'mathematician', and   '[' is not the same as  ' [' .


<b> HINT:</b> Writing ' ' in the second parameter of the function <b>replace</b> will remove the character instead of replacing it with another character.

In [30]:
# Loop through flat_list and use replace to remove unwanted characters, then save every new entry
# in the new list called flat_list2
flat_list2 = []
for s in flat_list:
    s = s.replace(']', '')
    s = s.replace('[', '')
    s = s.replace(" '", '')
    s = s.replace("'",'')
    flat_list2.append(s)

In [31]:
# Print out the first 30 entries of flat_list2
flat_list2[0:30]

['physicist',
 'astronomer',
 'mathematician',
 'philosopher',
 'diplomat',
 'poet',
 'theologian',
 'priest',
 'polymath',
 'historian',
 'scientist',
 'writer',
 'cleric',
 'university teacher',
 'mathematician',
 'publisher',
 'mathematician',
 'mathematician',
 'historian',
 'military personnel',
 'mathematician',
 'university teacher',
 '',
 'mathematician',
 'theoretical physicist',
 'inventor',
 'academic',
 'non-fiction writer',
 'physicist',
 'civil engineer']

Problem solved! Now use the function called <b>set</b> to retrieve the unique occupations from flat_list2. Read more about how to use it from here: https://www.geeksforgeeks.org/python-set-method/

In [34]:
# Retrieve the unique occupations from flat_list2
flat_list_unique = set(flat_list2)

In [35]:
# Print out the length of the list of unique occupations
len(flat_list_unique)

414

In [37]:
# Print out the list of unique occupations
flat_list_unique

{'',
 '"childrens writer"',
 '"marja"',
 '16112',
 '1803',
 '1804',
 '1835',
 '1841',
 '1868',
 '1874',
 '1893',
 '1931',
 '1939',
 '1981',
 '2000',
 '2002',
 '2003',
 '7 December 2016',
 'American football player',
 'Anglican priest',
 'Australian rules footballer',
 'Bible translator',
 'Brazzers',
 'Briggs',
 'Catedrático de universidad',
 'Catholic priest',
 'Christian apologetics',
 'Christian theology',
 'Digambara monk',
 'Director of Research at CNRS',
 'Dutch Wikipedia',
 'Encyclopédistes',
 'English Wikipedia',
 'Esperantist',
 'French moralists',
 'German Wikipedia',
 'Hellenist',
 'Hubrecht Institute',
 'Huygens',
 'Idist',
 'India',
 'Indologist',
 'Ingen',
 'Intelligence analysis',
 'International School for Advanced Studies',
 'Judaic scholar',
 'Justice of the Peace',
 'Latin',
 'McGill University',
 'Médico de familia',
 'Napier',
 'On the history of logarithms : Bürgi',
 'Paris Diderot University',
 'Persian Wikipedia',
 'Posek',
 'Q16267607',
 'Q19163412',
 'Q2768666

Apart from having some unusual entries in the list of unique occupations (which we will ignore for the sake of time), it has a broad variety of occupations which shows how flexible Mathematicians are.
- - - - -

## Analysing gender distribution

Now let's take a look into the difference between the number of male and female mathematicians:

In [39]:
# Print out the unique entries in the "sex or gender" column
set(df["sex or gender"])

{"['female', 'Italian Wikipedia']",
 "['female']",
 "['intersex', 'female']",
 "['male', 'Integrated Authority File', '9 April 2014', 'data.bnf.fr', '10 October 2015', 'http://data.bnf.fr/ark:/12148/cb119176085']",
 "['male', 'Swedish Wikipedia', 'Virtual International Authority File', 'Italian Wikipedia']",
 "['male', 'Virtual International Authority File', 'Integrated Authority File', '9 April 2014', 'data.bnf.fr', '10 October 2015', 'http://data.bnf.fr/ark:/12148/cb118976048']",
 "['male', 'Virtual International Authority File', 'Italian Wikipedia']",
 "['male']",
 nan}

It seems to produce some weird entries like ['male', 'Swedish Wikipedia', 'Virtual International Authority File', 'Italian Wikipedia'] containing info that should not be in the gender column. This is definitely not the format that we want the gender column to have.

In the following tasks, we are going to modify the dataset to match what we need.

Looking closely at the unique entries in "sex or gender", we either have <b>male</b>, <b>female</b>, <b>intersex</b> or <b>NaN</b>. Some of the entries have those words with some other unwanted entries, but we know that each entry contains one of those 4 words.  This makes it easy to change every entry in the column so that it is either "male", "female", "intersex" or "not specified". 
 
We use a function called <b>str.contains</b> which checks every string entry in a list on whether it contains a specific word (or generally a combination of characters of interest) and outputs a list with each entry being either a True or False, where True represents the existence of the word in the entry, and False the absence.

Read more about it here: https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html and use it in order to create four new boolean columns in the dataframe: "male", "female", "intersex" and "not specified".

In [42]:
# Create the three new columns specified above in the dataframe
df['sex or gender'] = df['sex or gender'].fillna('') #This replaces the NaN values with empty values in order to be able to apply str.contains
df['male']=df['sex or gender'].str.contains("'male'")
df['female']=df['sex or gender'].str.contains("'female'")
df['intersex']=df['sex or gender'].str.contains("'intersex'")
df['not specified'] = ~(df['sex or gender'].str.contains("'intersex'")) & ~(df['sex or gender'].str.contains("'female'")) & ~(df['sex or gender'].str.contains("'male'"))

In [45]:
# Count the number of males
male = df['male'].sum()
male

7778

In [46]:
# Count the number of females
female = df['female'].sum()
female

790

In [48]:
# Count the number of intersex
intersex = df['intersex'].sum()
intersex

1

In [50]:
# Count the number of not specified
unspecified = df['not specified'].sum()
unspecified

28

In [52]:
# Calculate the proportions of male and female mathematicians in the dataset
print('The proportions are:')
print('Male =',male*100 /(female + male + intersex + unspecified), '%')
print('Female =', female*100 /(female + male + intersex + unspecified), '%' )

The proportions are:
Male = 90.47342096080028 %
Female = 9.189252064673724 %


Now that is a big difference between the number of male and female Mathematicians. This is a clear indication that the data is biased towards male Mathematicians. 

Do you think that this might be a problem when using the data for predictions and drawing conclusions? Have a think about it.

We now create a new column called gender that has one of the following values: male, female, intersex or not specified.

In [54]:
# This code creates the gender column with the format that we want
gender = []
x = range(df.shape[0])
for i in x:
    if df['sex or gender'].str.contains("'male'")[i] == True:
        gender.append('male')
    elif df['sex or gender'].str.contains("'intersex'")[i] == True: 
        gender.append('intersex')
    elif df['sex or gender'].str.contains("'female'")[i] == True:
        gender.append('female')
    else: 
        gender.append('not specified')

In [66]:
# Replace the 'sex or gender' column with 'gender'
df['sex or gender'] = gender

In [67]:
# Preview the 'sex or gender' column
df['sex or gender']

0         male
1       female
2         male
3         male
4         male
         ...  
8591      male
8592      male
8593      male
8594      male
8595      male
Name: sex or gender, Length: 8596, dtype: object

All solved! Now that we formatted the 'sex or gender' column, we can get rid of the boolean columns 'male', 'female', 'intersex' and 'not-specified'

In [73]:
# Remove the following columns: 'male', 'female', 'intersex' and 'not-specified'
df = df.drop(columns = ['male', 'female', 'intersex', 'not specified'])
df

Unnamed: 0,mathematicians,occupation,country of citizenship,place of birth,date of death,educated at,employer,place of death,member of,employer.1,...,instance of,sex or gender,approx. date of birth,day of birth,month of birth,year of birth,approx. date of death,day of death,month of death,year of death
0,Roger Joseph Boscovich,"['physicist', 'astronomer', 'mathematician', '...",['Republic of Ragusa'],"Dubrovnik, Republic of Ragusa",13 February 1787,['Pontifical Gregorian University'],['Pontifical Gregorian University'],"['Milan', 'Habsburg Empire']","['Royal Society', 'Russian Academy of Sciences...",['Pontifical Gregorian University'],...,['human'],male,False,18.0,May,1711,False,13.0,February,1787
1,Emma Previato,['mathematician'],"['United States of America', 'Italy']",Badia Polesine,,"['Harvard University', 'University of Padua']","['Boston University', 'University of Padua']",,['American Mathematical Society'],"['Boston University', 'University of Padua']",...,['human'],female,False,,,1952,False,,,
2,Feodor Deahna,['mathematician'],,,1844,,,,,,...,['human'],male,False,,,1815,False,,,1844
3,Denis Henrion,"['publisher', 'mathematician']",['France'],,1640,,,,,,...,['human'],male,True,,,1500,False,,,1640
4,Henri Delannoy,"['mathematician', 'historian', 'military perso...",['France'],Bourbonne-les-Bains,5 February 1915,['École Polytechnique'],,['Guéret'],,,...,['human'],male,False,28.0,September,1833,False,5.0,February,1915
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8591,Hugh Hamilton (bishop),['priest'],,County Dublin,1 December 1805,['Trinity College Dublin'],,['Kilkenny'],['Royal Society'],,...,['human'],male,False,26.0,March,1729,False,1.0,December,1805
8592,Boáz Klartag,"['mathematician', 'university teacher']",['Israel'],,,['Tel Aviv University'],,,,,...,['human'],male,False,25.0,April,1978,False,,,
8593,Alain M. Robert,['mathematician'],['Switzerland'],,,,,,,,...,['human'],male,False,15.0,October,1941,False,,,
8594,Paul C. Rosenbloom,['mathematician'],['United States of America'],Portsmouth,May 2005,,,,,,...,['human'],male,False,31.0,March,1920,False,,May,2005


## Analysing the dataset

In [74]:
# Are there any rows that do not have any missing values? (Output True or False)
(df.isna().sum(axis=1) == 0).any()

False

In [75]:
# Output the rows that have at least 10 missing values
df.loc[df.isna().sum(axis = 1) >= 10]

Unnamed: 0,mathematicians,occupation,country of citizenship,place of birth,date of death,educated at,employer,place of death,member of,employer.1,...,instance of,sex or gender,approx. date of birth,day of birth,month of birth,year of birth,approx. date of death,day of death,month of death,year of death
1,Emma Previato,['mathematician'],"['United States of America', 'Italy']",Badia Polesine,,"['Harvard University', 'University of Padua']","['Boston University', 'University of Padua']",,['American Mathematical Society'],"['Boston University', 'University of Padua']",...,['human'],female,False,,,1952,False,,,
2,Feodor Deahna,['mathematician'],,,1844,,,,,,...,['human'],male,False,,,1815,False,,,1844
3,Denis Henrion,"['publisher', 'mathematician']",['France'],,1640,,,,,,...,['human'],male,True,,,1500,False,,,1640
4,Henri Delannoy,"['mathematician', 'historian', 'military perso...",['France'],Bourbonne-les-Bains,5 February 1915,['École Polytechnique'],,['Guéret'],,,...,['human'],male,False,28.0,September,1833,False,5.0,February,1915
5,John Hymers,['mathematician'],,Ormesby,7 April 1887,"[""St John's College""]",,['Brandesburton'],['Royal Society'],,...,['human'],male,False,20.0,July,1803,False,7.0,April,1887
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8591,Hugh Hamilton (bishop),['priest'],,County Dublin,1 December 1805,['Trinity College Dublin'],,['Kilkenny'],['Royal Society'],,...,['human'],male,False,26.0,March,1729,False,1.0,December,1805
8592,Boáz Klartag,"['mathematician', 'university teacher']",['Israel'],,,['Tel Aviv University'],,,,,...,['human'],male,False,25.0,April,1978,False,,,
8593,Alain M. Robert,['mathematician'],['Switzerland'],,,,,,,,...,['human'],male,False,15.0,October,1941,False,,,
8594,Paul C. Rosenbloom,['mathematician'],['United States of America'],Portsmouth,May 2005,,,,,,...,['human'],male,False,31.0,March,1920,False,,May,2005


In [83]:
# Output the rows of the mathematicians that have the year of birth filled in, name the list of rows x
x = df.loc[df['year of birth'].isna() == False]
x

Unnamed: 0,mathematicians,occupation,country of citizenship,place of birth,date of death,educated at,employer,place of death,member of,employer.1,...,instance of,sex or gender,approx. date of birth,day of birth,month of birth,year of birth,approx. date of death,day of death,month of death,year of death
0,Roger Joseph Boscovich,"['physicist', 'astronomer', 'mathematician', '...",['Republic of Ragusa'],"Dubrovnik, Republic of Ragusa",13 February 1787,['Pontifical Gregorian University'],['Pontifical Gregorian University'],"['Milan', 'Habsburg Empire']","['Royal Society', 'Russian Academy of Sciences...",['Pontifical Gregorian University'],...,['human'],male,False,18.0,May,1711,False,13.0,February,1787
1,Emma Previato,['mathematician'],"['United States of America', 'Italy']",Badia Polesine,,"['Harvard University', 'University of Padua']","['Boston University', 'University of Padua']",,['American Mathematical Society'],"['Boston University', 'University of Padua']",...,['human'],female,False,,,1952,False,,,
2,Feodor Deahna,['mathematician'],,,1844,,,,,,...,['human'],male,False,,,1815,False,,,1844
3,Denis Henrion,"['publisher', 'mathematician']",['France'],,1640,,,,,,...,['human'],male,True,,,1500,False,,,1640
4,Henri Delannoy,"['mathematician', 'historian', 'military perso...",['France'],Bourbonne-les-Bains,5 February 1915,['École Polytechnique'],,['Guéret'],,,...,['human'],male,False,28.0,September,1833,False,5.0,February,1915
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8591,Hugh Hamilton (bishop),['priest'],,County Dublin,1 December 1805,['Trinity College Dublin'],,['Kilkenny'],['Royal Society'],,...,['human'],male,False,26.0,March,1729,False,1.0,December,1805
8592,Boáz Klartag,"['mathematician', 'university teacher']",['Israel'],,,['Tel Aviv University'],,,,,...,['human'],male,False,25.0,April,1978,False,,,
8593,Alain M. Robert,['mathematician'],['Switzerland'],,,,,,,,...,['human'],male,False,15.0,October,1941,False,,,
8594,Paul C. Rosenbloom,['mathematician'],['United States of America'],Portsmouth,May 2005,,,,,,...,['human'],male,False,31.0,March,1920,False,,May,2005


The following tasks are going to require the knowledge of a function called <b>str.contains</b> which is mentioned and explained in the section named <b>Analysing gender distribution</b> in this notebook. 

In [85]:
# Check the datatype of the 'year of birth' column
x['year of birth'].dtype

dtype('O')

We would like to have the data type of the entries of the 'year of birth' column to be numeric in order to be able to perform mathematical operations that will be needed in the following tasks.

In the next task, convert the data type of the column 'year of birth' to numeric. Do you run into a problem? Inspect the column entries to find out what is causing it and fix it. 

Note: You can either edit the entries that are causing the problems, or remove them entirely.

In [86]:
# Convert the datatype of the column 'year of birth' to numeric
x = x.drop(x.loc[x['year of birth'].str.contains('s')].index)
x['year of birth'] = pd.to_numeric(x['year of birth'])

In [87]:
# Output the row of the oldest mathematician in x
x.loc[x['year of birth'] == x['year of birth'].min()]

Unnamed: 0,mathematicians,occupation,country of citizenship,place of birth,date of death,educated at,employer,place of death,member of,employer.1,...,instance of,sex or gender,approx. date of birth,day of birth,month of birth,year of birth,approx. date of death,day of death,month of death,year of death
8087,Ahmes,"['scribe', 'mathematician', 'Q19163412']",['Egypt'],Egypt,1620 BCE,,,,,,...,['human'],male,False,,,-1680,False,,,-1620


In [88]:
# Output the row of the most recently born mathematician in x
x.loc[x['year of birth'] == x['year of birth'].max()]

Unnamed: 0,mathematicians,occupation,country of citizenship,place of birth,date of death,educated at,employer,place of death,member of,employer.1,...,instance of,sex or gender,approx. date of birth,day of birth,month of birth,year of birth,approx. date of death,day of death,month of death,year of death
5704,Yasha Asley,['mathematician'],,,,,,,,,...,['human'],male,False,,,2003,False,,,


In [89]:
# Output the range of the year of birth column
x['year of birth'].max() - x['year of birth'].min()

3683

In [90]:
# Output the rows of the mathematicians that have studied at the University of Cambridge
df.loc[df['educated at'].str.contains('Cambridge') == True]

Unnamed: 0,mathematicians,occupation,country of citizenship,place of birth,date of death,educated at,employer,place of death,member of,employer.1,...,instance of,sex or gender,approx. date of birth,day of birth,month of birth,year of birth,approx. date of death,day of death,month of death,year of death
49,Thomas John I'Anson Bromwich,"['mathematician', 'physicist']",['United Kingdom'],Wolverhampton,26 August 1929,"['University of Cambridge', ""St John's College""]",,['Northampton'],['Royal Society'],,...,['human'],male,False,8.0,February,1875,False,26.0,August,1929
153,Isaac Barrow,"['theologian', 'mathematician', 'historian of ...",['Kingdom of England'],London,4 May 1677,"['University of Cambridge', 'Trinity College',...",['Gresham College'],['London'],['Royal Society'],['Gresham College'],...,['human'],male,False,,October,1630,False,4.0,May,1677
284,Edward H. Simpson,"['mathematician', 'statistician']",['United Kingdom'],,,['University of Cambridge'],,,,,...,['human'],male,False,10.0,December,1922,False,,,
567,Kevin Costello,"['mathematician', 'university teacher']",,Cork,,['University of Cambridge'],,,,,...,['human'],male,False,,,1977,False,,,
676,Derek Frank Lawden,['mathematician'],['New Zealand'],Birmingham,15 February 2008,['University of Cambridge'],['University of Canterbury'],['Warwick'],,['University of Canterbury'],...,['human'],male,False,15.0,September,1919,False,15.0,February,2008
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8197,Freeman Dyson,"['mathematician', 'theoretical physicist', 'nu...","['United Kingdom', 'United States of America']",Crowthorne,,"['University of Cambridge', 'Cornell Universit...",,,"['Royal Society', 'French Academy of Sciences'...",,...,['human'],male,False,15.0,December,1923,False,,,
8200,Tamás Hausel,['mathematician'],['Hungary'],Hungary,,"['Eötvös Loránd University', 'University of Ca...","['University of Oxford', 'Institute for Advanc...",,,"['University of Oxford', 'Institute for Advanc...",...,['human'],male,False,,,1972,False,,,
8233,Seymour Papert,"['mathematician', 'computer scientist', 'educa...","['South Africa', 'United States of America']",Pretoria,31 July 2016,"['University of the Witwatersrand', '1952', 'd...","['Massachusetts Institute of Technology', '1963']",['Blue Hill'],,"['Massachusetts Institute of Technology', '1963']",...,['human'],male,False,29.0,February,1928,False,31.0,July,2016
8420,Sarah Woodhead,"['head teacher', 'teacher']",['United Kingdom'],,1912,['University of Cambridge'],['Bolton School'],,,['Bolton School'],...,['human'],female,False,,,1851,False,,,1912


In [43]:
# How many mathematicians were interested in any of the fields of Analysis?
df.loc[df['field of work'].str.contains('analysis') == True].s

## Extra task

Here is a link for you to check out that shows the investigations that the owner of this dataset performed: https://www.kaggle.com/joephilleo/investigating-the-mathematicians-of-wikipedia