# Pandas 

## Plan

1. read in a dataset
2. getting some immediate information about the data
3. subset our data
    1. by column
    1. by row (filtering)
    1. getting max value locations (`.idxmin()`, `idxmax()`)
4. sort our data
5. make changes to our data
    1. new column names
    1. change column values
    1. create new columns
6. calculate summary stats
    1. for the whole dataset
    1. for groups
7. dealing with missing values

In [1]:
# read in data 
import pandas as pd 
import numpy as np 

In [2]:
df = pd.read_csv('data/life_expectancy_and_income_missing.csv')

In [3]:
# getting some immediate information about the data
df.columns

Index(['country', 'year', 'fertility_rate', 'income_per_person',
       'life_expectancy'],
      dtype='object')

In [4]:
df.shape

(22080, 5)

In [5]:
df.head()

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
0,Afghanistan,1900,7.0,1090.0,29.4
1,Afghanistan,1901,7.0,1110.0,
2,Afghanistan,1902,7.0,1120.0,29.5
3,Afghanistan,1903,7.0,1140.0,29.6
4,Afghanistan,1904,7.0,1160.0,29.7


In [6]:
df.describe()
#distribution of the data 
# only numeric data 

Unnamed: 0,year,fertility_rate,income_per_person,life_expectancy
count,22080.0,22080.0,22066.0,21896.0
mean,1959.5,4.84077,7608.560274,52.725449
std,34.640598,1.916428,13450.013474,16.743644
min,1900.0,1.12,312.0,1.1
25%,1929.75,2.98,1370.0,35.8
50%,1959.5,5.45,2880.0,53.8
75%,1989.25,6.5,7710.0,68.3
max,2019.0,8.87,179000.0,85.1


In [7]:
df.describe(include = 'all')
# this also gives you other things 

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
count,22080,22080.0,22080.0,22066.0,21896.0
unique,184,,,,
top,Afghanistan,,,,
freq,120,,,,
mean,,1959.5,4.84077,7608.560274,52.725449
std,,34.640598,1.916428,13450.013474,16.743644
min,,1900.0,1.12,312.0,1.1
25%,,1929.75,2.98,1370.0,35.8
50%,,1959.5,5.45,2880.0,53.8
75%,,1989.25,6.5,7710.0,68.3


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22080 entries, 0 to 22079
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   country            22080 non-null  object 
 1   year               22080 non-null  int64  
 2   fertility_rate     22080 non-null  float64
 3   income_per_person  22066 non-null  float64
 4   life_expectancy    21896 non-null  float64
dtypes: float64(3), int64(1), object(1)
memory usage: 862.6+ KB


In [11]:
df.info(verbose = False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22080 entries, 0 to 22079
Columns: 5 entries, country to life_expectancy
dtypes: float64(3), int64(1), object(1)
memory usage: 862.6+ KB


Excercise
For the exercises in this lesson we'll use a dataset about occupational prestige. Here is some information on the variables contained in the data.

education Average years of education of occupational incumbents, years, in 1971.
income Average income of incumbents, dollars, in 1971.
women Percentage of incumbents who are women.
prestige Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.
census Canadian Census occupational code.
type Type of occupation. A factor with levels (note: out of order): bc, Blue Collar; prof, Professional, Managerial, and Technical; wc, White Collar.

Read in the file prestige_occupation.csv. Save it as an object called occupation_prestige.
How many rows and columns are in the data?
What is the average (mean) value for the column prestige (hint: you can use describe to answer this).

In [12]:
df_op = pd.read_csv('data/prestige_occupation.csv')

In [13]:
df_op.shape

(102, 7)

In [19]:
df_op["prestige"].describe()

count    102.000000
mean      46.833333
std       17.204486
min       14.800000
25%       35.225000
50%       43.600000
75%       59.275000
max       87.200000
Name: prestige, dtype: float64

In [20]:
#subsetting data 
df.country

0        Afghanistan
1        Afghanistan
2        Afghanistan
3        Afghanistan
4        Afghanistan
            ...     
22075       Zimbabwe
22076       Zimbabwe
22077       Zimbabwe
22078       Zimbabwe
22079       Zimbabwe
Name: country, Length: 22080, dtype: object

In [21]:
df['country']
# series

0        Afghanistan
1        Afghanistan
2        Afghanistan
3        Afghanistan
4        Afghanistan
            ...     
22075       Zimbabwe
22076       Zimbabwe
22077       Zimbabwe
22078       Zimbabwe
22079       Zimbabwe
Name: country, Length: 22080, dtype: object

In [22]:
df[['year', 'country']]
# dataframe

Unnamed: 0,year,country
0,1900,Afghanistan
1,1901,Afghanistan
2,1902,Afghanistan
3,1903,Afghanistan
4,1904,Afghanistan
...,...,...
22075,2015,Zimbabwe
22076,2016,Zimbabwe
22077,2017,Zimbabwe
22078,2018,Zimbabwe


In [23]:
df.loc[:, ['country', 'year']]

Unnamed: 0,country,year
0,Afghanistan,1900
1,Afghanistan,1901
2,Afghanistan,1902
3,Afghanistan,1903
4,Afghanistan,1904
...,...,...
22075,Zimbabwe,2015
22076,Zimbabwe,2016
22077,Zimbabwe,2017
22078,Zimbabwe,2018


In [24]:
mask = df.country == 'Japan'
df.loc[mask]

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
9720,Japan,1900,4.69,1860.0,38.7
9721,Japan,1901,5.01,1900.0,
9722,Japan,1902,4.97,1780.0,39.0
9723,Japan,1903,4.83,1880.0,39.1
9724,Japan,1904,4.61,1870.0,39.2
...,...,...,...,...,...
9835,Japan,2015,1.44,37800.0,84.1
9836,Japan,2016,1.46,38100.0,84.2
9837,Japan,2017,1.47,38900.0,84.2
9838,Japan,2018,1.48,39300.0,84.4


In [25]:
df.loc[mask, ['country', 'year', 'life_expectancy']]

Unnamed: 0,country,year,life_expectancy
9720,Japan,1900,38.7
9721,Japan,1901,
9722,Japan,1902,39.0
9723,Japan,1903,39.1
9724,Japan,1904,39.2
...,...,...,...
9835,Japan,2015,84.1
9836,Japan,2016,84.2
9837,Japan,2017,84.2
9838,Japan,2018,84.4


In [27]:
# mask is condition 1 and condition 2
mask = (df.country == 'Japan') & (df.year > 2000)

In [29]:
df.loc[mask]

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
9821,Japan,2001,1.31,33900.0,81.7
9822,Japan,2002,1.3,33900.0,82.0
9823,Japan,2003,1.3,34300.0,82.1
9824,Japan,2004,1.3,35100.0,82.3
9825,Japan,2005,1.31,35700.0,82.3
9826,Japan,2006,1.32,36100.0,82.6
9827,Japan,2007,1.33,36700.0,82.8
9828,Japan,2008,1.34,36300.0,82.9
9829,Japan,2009,1.36,34300.0,83.1
9830,Japan,2010,1.37,35800.0,83.1


In [32]:
#mask = df.country == 'Japan' or (df.country == 'Italy')
mask_1 = df.country.isin(['Japan','Italy'])
df.loc[mask_1]

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
9480,Italy,1900,4.53,3780.0,41.7
9481,Italy,1901,4.49,3840.0,
9482,Italy,1902,4.46,3900.0,43.0
9483,Italy,1903,4.43,3940.0,43.1
9484,Italy,1904,4.44,4010.0,44.4
...,...,...,...,...,...
9835,Japan,2015,1.44,37800.0,84.1
9836,Japan,2016,1.46,38100.0,84.2
9837,Japan,2017,1.47,38900.0,84.2
9838,Japan,2018,1.48,39300.0,84.4


In [34]:
#min and max row locations 
#saving subssets 
i = df.life_expectancy.idxmin()
i

16338

In [36]:
df.loc[i]

country               Samoa
year                   1918
fertility_rate         6.98
income_per_person    2050.0
life_expectancy         1.1
Name: 16338, dtype: object

In [38]:
df.loc[df.life_expectancy.idxmax()]

country              Singapore
year                      2019
fertility_rate            1.27
income_per_person      90100.0
life_expectancy           85.1
Name: 17279, dtype: object

In [39]:
# saving 
my_list = [0,1,2,3,4]

In [40]:
my_list_view = my_list
my_list_view.append(10)
my_list

[0, 1, 2, 3, 4, 10]

In [41]:
my_list_copy = my_list.copy()
my_list_copy.remove(10)
print(my_list)
print(my_list_copy)

[0, 1, 2, 3, 4, 10]
[0, 1, 2, 3, 4]


In [43]:
mask = (df.country == 'Japan') & (df.year > 2000)
jp_df = df.loc[mask, ['country', 'year', 'life_expectancy']].copy()

In [44]:
jp_df

Unnamed: 0,country,year,life_expectancy
9821,Japan,2001,81.7
9822,Japan,2002,82.0
9823,Japan,2003,82.1
9824,Japan,2004,82.3
9825,Japan,2005,82.3
9826,Japan,2006,82.6
9827,Japan,2007,82.8
9828,Japan,2008,82.9
9829,Japan,2009,83.1
9830,Japan,2010,83.1


Make a new data frame called, job_incomes, that has just the "job", "type" and "income" column from occupation_prestige. Make sure the columns appear in the order we have given.

In [45]:
df_op[["job", "type", "income"]]

Unnamed: 0,job,type,income
0,gov.administrators,prof,12351
1,general.managers,prof,25879
2,accountants,prof,9271
3,purchasing.officers,prof,8865
4,chemists,prof,8403
...,...,...,...
97,bus.drivers,bc,5562
98,taxi.drivers,bc,4224
99,longshoremen,bc,4753
100,typesetters,bc,6462



Task - 5 minutes
Return 2 rows at the same time, with the highest and lowest income!
[Hint - you can pass a list as an argument for loc()] (edited) 

In [51]:
mk = (df_op.income == max(df_op.income))|(df_op.income == min(df_op.income))
df_op.loc[mk]

Unnamed: 0,job,education,income,women,prestige,census,type
1,general.managers,12.26,25879,4.02,69.1,1130,prof
62,babysitters,9.46,611,96.53,25.9,6147,


In [50]:
df_op.loc[[df_op.income.idxmax(), df_op.income.idxmin()]]

Unnamed: 0,job,education,income,women,prestige,census,type
1,general.managers,12.26,25879,4.02,69.1,1130,prof
62,babysitters,9.46,611,96.53,25.9,6147,


In [52]:
#sorting 
df.sort_values('year')

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
0,Afghanistan,1900,7.00,1090.0,29.4
3360,Cameroon,1900,5.54,924.0,29.7
480,Antigua and Barbuda,1900,4.63,1300.0,33.8
11280,Lithuania,1900,4.96,2740.0,41.7
11160,Libya,1900,7.20,2470.0,34.7
...,...,...,...,...,...
14159,Niger,2019,7.07,954.0,63.2
14039,Nicaragua,2019,2.12,4620.0,79.2
13919,New Zealand,2019,1.96,36500.0,81.9
13679,Nepal,2019,2.02,2880.0,71.5


In [53]:
df.sort_values(['year','country'])

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
0,Afghanistan,1900,7.00,1090.0,29.4
120,Albania,1900,4.60,1220.0,35.4
240,Algeria,1900,6.99,1750.0,30.2
360,Angola,1900,7.00,958.0,29.0
480,Antigua and Barbuda,1900,4.63,1300.0,33.8
...,...,...,...,...,...
21599,Venezuela,2019,2.25,9720.0,75.1
21719,Vietnam,2019,1.94,6970.0,74.7
21839,Yemen,2019,3.69,2340.0,68.1
21959,Zambia,2019,4.81,3700.0,64.0


sort the demo data by descending 

In [54]:
df.sort_values('life_expectancy', ascending = False)

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
17279,Singapore,2019,1.27,90100.0,85.1
17278,Singapore,2018,1.26,90100.0,85.0
17277,Singapore,2017,1.25,87800.0,84.8
17276,Singapore,2016,1.25,84700.0,84.7
9839,Japan,2019,1.50,39700.0,84.5
...,...,...,...,...,...
21481,Venezuela,1901,5.64,2000.0,
21601,Vietnam,1901,4.66,969.0,
21721,Yemen,1901,6.88,1170.0,
21841,Zambia,1901,6.71,847.0,


In [55]:
df.sort_values(['year','country'], inplace = True)
df

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
0,Afghanistan,1900,7.00,1090.0,29.4
120,Albania,1900,4.60,1220.0,35.4
240,Algeria,1900,6.99,1750.0,30.2
360,Angola,1900,7.00,958.0,29.0
480,Antigua and Barbuda,1900,4.63,1300.0,33.8
...,...,...,...,...,...
21599,Venezuela,2019,2.25,9720.0,75.1
21719,Vietnam,2019,1.94,6970.0,74.7
21839,Yemen,2019,3.69,2340.0,68.1
21959,Zambia,2019,4.81,3700.0,64.0


In [None]:
df.rename(columns = {'old':"new"})

In [57]:
df.rename(columns= {'income_per_person':'GDP'}, inplace = True)
df

Unnamed: 0,country,year,fertility_rate,GDP,life_expectancy
0,Afghanistan,1900,7.00,1090.0,29.4
120,Albania,1900,4.60,1220.0,35.4
240,Algeria,1900,6.99,1750.0,30.2
360,Angola,1900,7.00,958.0,29.0
480,Antigua and Barbuda,1900,4.63,1300.0,33.8
...,...,...,...,...,...
21599,Venezuela,2019,2.25,9720.0,75.1
21719,Vietnam,2019,1.94,6970.0,74.7
21839,Yemen,2019,3.69,2340.0,68.1
21959,Zambia,2019,4.81,3700.0,64.0


In [58]:
df.rename(columns = {'country':"place", 'year':'time'})

Unnamed: 0,place,time,fertility_rate,GDP,life_expectancy
0,Afghanistan,1900,7.00,1090.0,29.4
120,Albania,1900,4.60,1220.0,35.4
240,Algeria,1900,6.99,1750.0,30.2
360,Angola,1900,7.00,958.0,29.0
480,Antigua and Barbuda,1900,4.63,1300.0,33.8
...,...,...,...,...,...
21599,Venezuela,2019,2.25,9720.0,75.1
21719,Vietnam,2019,1.94,6970.0,74.7
21839,Yemen,2019,3.69,2340.0,68.1
21959,Zambia,2019,4.81,3700.0,64.0


In [61]:
# changeing col 
df.fertility_rate = df.fertility_rate * 4 
df

Unnamed: 0,country,year,fertility_rate,GDP,life_expectancy
0,Afghanistan,1900,112.00,1090.0,29.4
120,Albania,1900,73.60,1220.0,35.4
240,Algeria,1900,111.84,1750.0,30.2
360,Angola,1900,112.00,958.0,29.0
480,Antigua and Barbuda,1900,74.08,1300.0,33.8
...,...,...,...,...,...
21599,Venezuela,2019,36.00,9720.0,75.1
21719,Vietnam,2019,31.04,6970.0,74.7
21839,Yemen,2019,59.04,2340.0,68.1
21959,Zambia,2019,76.96,3700.0,64.0


In [63]:
df["f_rate"] = df.fertility_rate / 4 
df

Unnamed: 0,country,year,fertility_rate,GDP,life_expectancy,f_rate
0,Afghanistan,1900,112.00,1090.0,29.4,28.00
120,Albania,1900,73.60,1220.0,35.4,18.40
240,Algeria,1900,111.84,1750.0,30.2,27.96
360,Angola,1900,112.00,958.0,29.0,28.00
480,Antigua and Barbuda,1900,74.08,1300.0,33.8,18.52
...,...,...,...,...,...,...
21599,Venezuela,2019,36.00,9720.0,75.1,9.00
21719,Vietnam,2019,31.04,6970.0,74.7,7.76
21839,Yemen,2019,59.04,2340.0,68.1,14.76
21959,Zambia,2019,76.96,3700.0,64.0,19.24


In [None]:
# creating new col 

In [66]:
df.assign(years_since_1900 = df.year - 1900)
# doesn't change the original df 
df_1900 =(
    df
    .assign(years_since_1900 = df.year - 1900, # create
           fertillity_rate = round(df.fertility_rate)) # modify 
    .copy()
)
df_1900

Unnamed: 0,country,year,fertility_rate,GDP,life_expectancy,f_rate,years_since_1900,fertillity_rate
0,Afghanistan,1900,112.00,1090.0,29.4,28.00,0,112.0
120,Albania,1900,73.60,1220.0,35.4,18.40,0,74.0
240,Algeria,1900,111.84,1750.0,30.2,27.96,0,112.0
360,Angola,1900,112.00,958.0,29.0,28.00,0,112.0
480,Antigua and Barbuda,1900,74.08,1300.0,33.8,18.52,0,74.0
...,...,...,...,...,...,...,...,...
21599,Venezuela,2019,36.00,9720.0,75.1,9.00,119,36.0
21719,Vietnam,2019,31.04,6970.0,74.7,7.76,119,31.0
21839,Yemen,2019,59.04,2340.0,68.1,14.76,119,59.0
21959,Zambia,2019,76.96,3700.0,64.0,19.24,119,77.0


Change the "income" column in prestige to be in 1000s of dollars, rather than dollars.

In [71]:
df_op_income = df_op.assign(income = df_op.income/1000).copy()
df_op_income

Unnamed: 0,job,education,income,women,prestige,census,type
0,gov.administrators,13.11,12.351,11.16,68.8,1113,prof
1,general.managers,12.26,25.879,4.02,69.1,1130,prof
2,accountants,12.77,9.271,15.70,63.4,1171,prof
3,purchasing.officers,11.42,8.865,9.11,56.8,1175,prof
4,chemists,14.62,8.403,11.68,73.5,2111,prof
...,...,...,...,...,...,...,...
97,bus.drivers,7.58,5.562,9.47,35.9,9171,bc
98,taxi.drivers,7.93,4.224,3.59,25.1,9173,bc
99,longshoremen,8.37,4.753,,26.1,9313,bc
100,typesetters,10.00,6.462,13.58,42.2,9511,bc


In [None]:
# summarising the data

In [72]:
df.describe()
# mean min ....

Unnamed: 0,year,fertility_rate,GDP,life_expectancy,f_rate
count,22080.0,22080.0,22066.0,21896.0,22080.0
mean,1959.5,77.452326,7608.560274,52.725449,19.363082
std,34.640598,30.66285,13450.013474,16.743644,7.665713
min,1900.0,17.92,312.0,1.1,4.48
25%,1929.75,47.68,1370.0,35.8,11.92
50%,1959.5,87.2,2880.0,53.8,21.8
75%,1989.25,104.0,7710.0,68.3,26.0
max,2019.0,141.92,179000.0,85.1,35.48


In [73]:
df.fertility_rate.min()

17.92

In [74]:
mask = df.life_expectancy < df.life_expectancy.median()
df.loc[mask]

Unnamed: 0,country,year,fertility_rate,GDP,life_expectancy,f_rate
0,Afghanistan,1900,112.00,1090.0,29.4,28.00
120,Albania,1900,73.60,1220.0,35.4,18.40
240,Algeria,1900,111.84,1750.0,30.2,27.96
360,Angola,1900,112.00,958.0,29.0,28.00
480,Antigua and Barbuda,1900,74.08,1300.0,33.8,18.52
...,...,...,...,...,...,...
3836,Central African Republic,2016,77.92,731.0,51.7,19.48
11036,Lesotho,2016,49.44,2940.0,52.5,12.36
3837,Central African Republic,2017,76.80,754.0,51.9,19.20
3838,Central African Republic,2018,75.52,775.0,52.4,18.88


In [78]:
# group by country 
# get max life_exp for each group 
df.groupby('country').life_expectancy.max().reset_index(name = "max_life_exp")

Unnamed: 0,country,max_life_exp
0,Afghanistan,64.1
1,Albania,78.5
2,Algeria,78.1
3,Angola,65.0
4,Antigua and Barbuda,77.3
...,...,...
179,Venezuela,75.3
180,Vietnam,74.7
181,Yemen,69.0
182,Zambia,64.0


For each question here, return a pandas series.
Find the maximum income for each type in the occupation_prestige data.
Find the average prestige for each type.
Find the lowest percentage of women for each type.

In [79]:
df_op.groupby('type').income.max()

type
bc       8895
prof    25879
wc       8780
Name: income, dtype: int64

In [80]:
df_op.groupby('type').prestige.mean()

type
bc      35.527273
prof    67.848387
wc      42.243478
Name: prestige, dtype: float64

In [82]:
df_op.groupby('type').women.min()

type
bc      0.52
prof    0.58
wc      3.16
Name: women, dtype: float64

In [84]:
df.isna().sum()

country              0
year                 0
fertility_rate       0
GDP                 14
life_expectancy    184
f_rate               0
dtype: int64

In [86]:
#missing values 
df.dropna().shape

(21882, 6)

In [87]:
df.dropna(subset = ['life_expectancy']).isna().sum()

country             0
year                0
fertility_rate      0
GDP                14
life_expectancy     0
f_rate              0
dtype: int64

In [88]:
#imputation fill them with values 
df.fillna(
    value = 0)

Unnamed: 0,country,year,fertility_rate,GDP,life_expectancy,f_rate
0,Afghanistan,1900,112.00,1090.0,29.4,28.00
120,Albania,1900,73.60,1220.0,35.4,18.40
240,Algeria,1900,111.84,1750.0,30.2,27.96
360,Angola,1900,112.00,958.0,29.0,28.00
480,Antigua and Barbuda,1900,74.08,1300.0,33.8,18.52
...,...,...,...,...,...,...
21599,Venezuela,2019,36.00,9720.0,75.1,9.00
21719,Vietnam,2019,31.04,6970.0,74.7,7.76
21839,Yemen,2019,59.04,2340.0,68.1,14.76
21959,Zambia,2019,76.96,3700.0,64.0,19.24


In [89]:
df.fillna(value = {
    'life_expectancy': 0,
    'income_per_person':50})

Unnamed: 0,country,year,fertility_rate,GDP,life_expectancy,f_rate
0,Afghanistan,1900,112.00,1090.0,29.4,28.00
120,Albania,1900,73.60,1220.0,35.4,18.40
240,Algeria,1900,111.84,1750.0,30.2,27.96
360,Angola,1900,112.00,958.0,29.0,28.00
480,Antigua and Barbuda,1900,74.08,1300.0,33.8,18.52
...,...,...,...,...,...,...
21599,Venezuela,2019,36.00,9720.0,75.1,9.00
21719,Vietnam,2019,31.04,6970.0,74.7,7.76
21839,Yemen,2019,59.04,2340.0,68.1,14.76
21959,Zambia,2019,76.96,3700.0,64.0,19.24


In [91]:
df.fillna(value = {
    'life_expectancy': df.life_expectancy.mean(),
    'income_per_person':df.GDP.median()
})

Unnamed: 0,country,year,fertility_rate,GDP,life_expectancy,f_rate
0,Afghanistan,1900,112.00,1090.0,29.4,28.00
120,Albania,1900,73.60,1220.0,35.4,18.40
240,Algeria,1900,111.84,1750.0,30.2,27.96
360,Angola,1900,112.00,958.0,29.0,28.00
480,Antigua and Barbuda,1900,74.08,1300.0,33.8,18.52
...,...,...,...,...,...,...
21599,Venezuela,2019,36.00,9720.0,75.1,9.00
21719,Vietnam,2019,31.04,6970.0,74.7,7.76
21839,Yemen,2019,59.04,2340.0,68.1,14.76
21959,Zambia,2019,76.96,3700.0,64.0,19.24


In [None]:
#groupby too mmean of the group 

In [93]:
mean_per_country = df.groupby('country').life_expectancy.transform('mean')

df.fillna(value = {
    'life_expectancy': mean_per_country,
    'income_per_person':df.GDP.median()
})

Unnamed: 0,country,year,fertility_rate,GDP,life_expectancy,f_rate
0,Afghanistan,1900,112.00,1090.0,29.4,28.00
120,Albania,1900,73.60,1220.0,35.4,18.40
240,Algeria,1900,111.84,1750.0,30.2,27.96
360,Angola,1900,112.00,958.0,29.0,28.00
480,Antigua and Barbuda,1900,74.08,1300.0,33.8,18.52
...,...,...,...,...,...,...
21599,Venezuela,2019,36.00,9720.0,75.1,9.00
21719,Vietnam,2019,31.04,6970.0,74.7,7.76
21839,Yemen,2019,59.04,2340.0,68.1,14.76
21959,Zambia,2019,76.96,3700.0,64.0,19.24


Find which columns have missing values in occupation_prestige.

Replace all the missing values in type with "other". Check that this change has been made in the occupation_prestige.

Find the average of women.

Without changing or removing missing values.

With all the missing values changed to 0.

With all the missing values dropped.

In [94]:
df_op.isna().sum()

job          0
education    0
income       0
women        5
prestige     0
census       0
type         4
dtype: int64

In [95]:
df_op.fillna(value = {
    'type': "other"
}
)

Unnamed: 0,job,education,income,women,prestige,census,type
0,gov.administrators,13.11,12351,11.16,68.8,1113,prof
1,general.managers,12.26,25879,4.02,69.1,1130,prof
2,accountants,12.77,9271,15.70,63.4,1171,prof
3,purchasing.officers,11.42,8865,9.11,56.8,1175,prof
4,chemists,14.62,8403,11.68,73.5,2111,prof
...,...,...,...,...,...,...,...
97,bus.drivers,7.58,5562,9.47,35.9,9171,bc
98,taxi.drivers,7.93,4224,3.59,25.1,9173,bc
99,longshoremen,8.37,4753,,26.1,9313,bc
100,typesetters,10.00,6462,13.58,42.2,9511,bc


In [96]:
df_op.women.mean()

30.47278350515466

In [97]:
df_op.fillna(value = {
    'women': 0
}
).women.mean()

28.979019607843156

In [99]:
df_op.dropna(subset = ['women']).women.mean()

30.47278350515466

In [101]:
(df_op.fillna(value = {
    'women': 0
}
).loc[:,['job', 'income', 'women', 'type']]
.assign(income_1000 = df_op.income/1000)
.groupby('type')
.income_1000
.mean()
)

type
bc       5.374136
prof    10.559452
wc       5.052304
Name: income_1000, dtype: float64