<a href="https://colab.research.google.com/github/blazaropinto/PDA_Data_Analysis_Python/blob/main/Describing_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Describing data with summary statistics
---

This worksheet is tied to the Pandas Getting Started Tutorials, picking out particular tutorials to link them into a theme here.

We will focus on describing data.  This is the least risky in terms of bias and inaccurate conclusions as it should focus just on what data is presented to us.

Each exercise will ask you to work through on tutorial on the Getting Started page, to try the code from the tutorial here and to try a second, similar action.

---

The practice data from the tutorials comes from a dataset on Titanic passengers.


### Exercise 1 - open the Titanic dataset
---

The Titanic dataset is stored at this URL:
https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv

Read the dataset into a pandas dataframe that you will call **titanic**.

**Test output**:  
The shape of the dataframe will be (891, 12)

In [1]:
import pandas as pd

url = 'https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv'

titanic = pd.read_csv(url)
display(titanic.head(3), titanic.shape)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


(891, 12)

### Exercise 2 - get summary information about the dataframe
---

Read through the tutorials:  
[What kind of data does pandas handle?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html#)  
[How do I read and write tabular data?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html)

Use panda functions to display the following:
1.  A technical summary of the data (info())
2.  A description of the numerical data (describe())
3.  Display the Series 'Age'

**Test output**:   
1.  The info should show that there are only 204 values in the Cabin series, out of 891 records.  
2.  The description should show 7 columns and a mean age of 29.699118
3.  The Age series should have values of type float64 and Length 891

In [3]:
print('A technical summary of the data: ')
display(titanic.info())
print('\n')

print('A description with the main statistics of the numerical data: ')
display(titanic.describe())
print('\n')

print('the first lines of the series \'Age\' and its whole length: ')
display(titanic.Age[:5], titanic.Age.size)  #use .size instead of .count() to include missing values
# .shape[0] also would work

A technical summary of the data: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


None



A description with the main statistics of the numerical data: 


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292




the first lines of the series 'Age' and its whole length: 


0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

891

### Exercise 3 - aggregating statistics
---

Read through the tutorial:  
[How to calculate summary statistics?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html#)  

Use panda functions to display the following summary statistics from the titanic dataset:  

1.  The average (mean) age of passengers  
2.  The median age and fare  
3.  The mean fare
4.  The modal fare and gender

**Test output**:   
29.699118, Age 28.0000 Fare 14.4542, 32.2042079685746, Fare 8.05 Sex male 


In [4]:
print('The average (mean) age of passengers is: ', titanic.Age.mean())
print('\n')

print('The median age and fare is: ')
display(titanic[['Age', 'Fare']].median())
print('\n')

print('The mean fare is: ', titanic.Fare.mean())
print('\n')

print('The modal fare and gender is: ')
display(titanic[['Fare', 'Sex']].mode())

The average (mean) age of passengers is:  29.69911764705882


The median age and fare is: 


Age     28.0000
Fare    14.4542
dtype: float64



The mean fare is:  32.2042079685746


The modal fare and gender is: 


Unnamed: 0,Fare,Sex
0,8.05,male


### Exercise 4 - displaying other statistics
---

Take a look at the list of methods available for giving summary statistics [here](https://pandas.pydata.org/docs/user_guide/basics.html#basics-stats) 

Use panda functions, and your existing knowledge, to display the following summary statistics from the titanic dataset:

1.  The total number of passengers on the titanic
2.  The age of the youngest passenger
3.  The most expensive ticket price
4.  The range of ticket prices
5.  The number of passenges with cabins
6.  The code for the port where the highest number of passengers embarked
7.  The most populous gender
8.  The standard deviation for age and fare

**Tests**:  
891, 0.42, 512.3292, 512.3292, 204, S, male, Age 14.526497 Fare 49.693429

In [35]:
print('The total number of passengers on the titanic: ', len(titanic.PassengerId))

print('The age of the youngest passenger: ', titanic.Age.min())

print('The most expensive ticket price: ', titanic.Fare.max())

print('The range of ticket prices: ', titanic.Fare.min(), 'to', titanic.Fare.max())

print('The number of passenges with cabins: ', titanic.Cabin.notna().sum())

print('The code for the port where the highest number of passengers embarked: ', titanic.Embarked.mode()[0])

print('The most populous gender: ', titanic.Sex.mode()[0])

#print('The standard deviation for age and fare: ', round(titanic.Age.std(),6), 'and', round(titanic.Fare.std(),3))
print('The standard deviation for age and fare: ')
display(titanic[['Age', 'Fare']].std())

The total number of passengers on the titanic:  891
The age of the youngest passenger:  0.42
The most expensive ticket price:  512.3292
The range of ticket prices:  0.0 to 512.3292
The number of passenges with cabins:  204
The code for the port where the highest number of passengers embarked:  S
The most populous gender:  male
The standard deviation for age and fare: 


Age     14.526497
Fare    49.693429
dtype: float64

### Exercise 5 - aggregating statistics grouped by category
---

Refer again to the tutorial  
[How to calculate summary statistics?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html#)   
looking particularly at the section on Aggregating statistics grouped by category.

1.  What is the mean age for male versus female Titanic passengers?
2.  What is the mean ticket fare price for each of the sex and cabin class combinations?
3.  What is the mean ticket fare price for passengers who embarked at each port?
4.  Which passenger class had the highest number of survivors (for now, just show the statistics - it may not be meaningful yet)?

**Test output**:  
1.  female 27.915709 male 30.726645
2.  
```
female  1         106.125798
            2          21.970121
            3          16.118810
male    1          67.226127
            2          19.741782
            3          12.661633
```
3.  
```
C    59.954144
Q    13.276030
S    27.079812
```
4. 

```
Survived  Pclass

0         1          80
          2          97
          3         372
1         1         136
          2          87
          3         119
```






In [10]:
print('The mean age for male versus female Titanic passengers: ')
display(titanic.groupby('Sex')['Age'].mean())
print('\n')

print('The mean ticket fare price for each of the sex and cabin class combinations: ')
display(titanic.groupby(['Sex', 'Pclass'])['Fare'].mean())
print('\n')

print('the mean ticket fare price for passengers who embarked at each port: ')
display(titanic.groupby('Embarked')['Fare'].mean())
print('\n')

print('The passenger class that had the highest number of survivors: ')
#display(titanic.groupby('Survived')['Pclass'].value_counts())
display(titanic.groupby(['Survived', 'Pclass']).size())
print('\n')

The mean age for male versus female Titanic passengers: 


Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64



The mean ticket fare price for each of the sex and cabin class combinations: 


Sex     Pclass
female  1         106.125798
        2          21.970121
        3          16.118810
male    1          67.226127
        2          19.741782
        3          12.661633
Name: Fare, dtype: float64



the mean ticket fare price for passengers who embarked at each port: 


Embarked
C    59.954144
Q    13.276030
S    27.079812
Name: Fare, dtype: float64



The passenger class that had the highest number of survivors: 


Survived  Pclass
0         1          80
          2          97
          3         372
1         1         136
          2          87
          3         119
dtype: int64





### Exercise 6 - an aggregation of different statistics
---

Use the function titanic.agg() as shown in the tutorial  
[How to calculate summary statistics?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html#)  

1.  Display

```
     {
         "Age": ["min", "max", "median", "skew"],
         "Fare": ["min", "max", "median", "mean"]
     }
```
2.  Display:  
min, max and mean for Age  
min, max and standard deviation for Fare  
count for Cabin

**Test output**:   
1.  	
```
                  Age	      Fare  
max	   80.000000	512.329200  
mean	  NaN	      32.204208  
median	28.000000	14.454200  
min	   0.420000	 0.000000  
skew	  0.389108	 NaN
```

2.   
```
	        Age	    Fare	   Cabin
count  NaN	    NaN	    204.0
max	80.000000  512.329200 NaN
mean   29.699118  NaN        NaN
min	0.420000   0.000000   NaN
std	NaN        49.693429  NaN
```




In [11]:
display(titanic.agg({
         "Age": ["min", "max", "median", "skew"],
         "Fare": ["min", "max", "median", "mean"]
     }))

Unnamed: 0,Age,Fare
max,80.0,512.3292
mean,,32.204208
median,28.0,14.4542
min,0.42,0.0
skew,0.389108,


In [12]:
display(titanic.agg({
    'Age': ['min', 'max', 'mean'],
    'Fare': ['min', 'max', 'std'],
    'Cabin': 'count'
      }))

Unnamed: 0,Age,Fare,Cabin
count,,,204.0
max,80.0,512.3292,
mean,29.699118,,
min,0.42,0.0,
std,,49.693429,


### Exercise 7 - count by category
---

Read the section Count number of records by category in the tutorial  
[How to calculate summary statistics?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html#)

1. Display the number of passengers of each gender who had a ticket
2. Display the number of passengers who embarked at each port and had a ticket
3. Calculate the percentage of PassengerIds who survived the sinking of the Titanic (*Hint:  try getting the PassengerIds with a count for survived or not.  Store this value in a new variable, which will contain a list/array.  The second item in this list will be the number who survived.  You can use this number and the count of PassengerIds to calculate the percentage*)

**Test output**:  
1.  female 314, male 577
2.  C 168, Q 77, S 644
3.  38.38383838383838



In [13]:
print('the number of passengers of each gender who had a ticket: ')
display(titanic.groupby('Sex')['Ticket'].count())

the number of passengers of each gender who had a ticket: 


Sex
female    314
male      577
Name: Ticket, dtype: int64

In [14]:
print('the number of passengers who embarked at each port and had a ticket: ')
display(titanic.groupby('Embarked')['Ticket'].count())

the number of passengers who embarked at each port and had a ticket: 


Embarked
C    168
Q     77
S    644
Name: Ticket, dtype: int64

In [27]:
print('percentage of PassengerIds who survived the sinking of the Titanic: ')
#display(titanic[titanic.Survived == 1]['PassengerId'].count()/titanic.PassengerId.count())
display(titanic.Survived.value_counts(normalize=True))

percentage of PassengerIds who survived the sinking of the Titanic: 


0    0.616162
1    0.383838
Name: Survived, dtype: float64

### Exercise 8 - summary happiness statistics
---

Open the data set here: https://github.com/futureCodersSE/working-with-data/blob/main/Happiness-Data/2019.xlsx?raw=true

It contains data on people's perception of happiness levels in a number of countries across the world.

1.  Display the number of records in the set  
2.  Display the description of the numerical data  
3.  Display the highest GDP and life expectancy  
4.  Display the mean, max and min for Freedom,  mean, max, min and skew for Generosity and mean, min, max and std for GDP  

**Test output**:  
1.  156
2.  Table showing count, mean, std, min, 25%, 50%, 75%, max for 8 columns
3.  GDP 0.905147, life expectancy 0.725244  
4.  


```
	   Freedom to make life choices	Generosity	GDP per capita
max	 0.631000	                   0.566000	  1.684000
mean	0.392571	                   0.184846	  0.905147
min	 0.000000	                   0.000000  	0.000000
skew	NaN	                        0.745942	  NaN
std	 NaN	                        NaN	       0.398389
```




In [30]:
happiness = pd.read_excel('https://github.com/futureCodersSE/working-with-data/blob/main/Happiness-Data/2019.xlsx?raw=true')
happiness.head()

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.34,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.6,1.383,1.573,0.996,0.592,0.252,0.41
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341
3,4,Iceland,7.494,1.38,1.624,1.026,0.591,0.354,0.118
4,5,Netherlands,7.488,1.396,1.522,0.999,0.557,0.322,0.298


In [34]:
print('number of records in the set:', happiness.shape[0])
print('description of the numerical data:')
display(happiness.info())
print('\n')
print('the highest GDP and life expectancy:')
display(happiness[['GDP per capita', 'Healthy life expectancy']].max())
print('mean, max and min for Freedom, mean, max, min and skew for Generosity and mean, min, max and std for GDP')
display(happiness.agg({
    'Freedom to make life choices': ['mean', 'max', 'min'],
    'Generosity': ['mean', 'max', 'min', 'skew'],
    'GDP per capita': ['mean', 'min', 'max', 'std']
    }))

number of records in the set: 156
description of the numerical data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Overall rank                  156 non-null    int64  
 1   Country or region             156 non-null    object 
 2   Score                         156 non-null    float64
 3   GDP per capita                156 non-null    float64
 4   Social support                156 non-null    float64
 5   Healthy life expectancy       156 non-null    float64
 6   Freedom to make life choices  156 non-null    float64
 7   Generosity                    156 non-null    float64
 8   Perceptions of corruption     156 non-null    float64
dtypes: float64(7), int64(1), object(1)
memory usage: 11.1+ KB


None



the highest GDP and life expectancy:


GDP per capita             1.684
Healthy life expectancy    1.141
dtype: float64

mean, max and min for Freedom, mean, max, min and skew for Generosity and mean, min, max and std for GDP


Unnamed: 0,Freedom to make life choices,Generosity,GDP per capita
max,0.631,0.566,1.684
mean,0.392571,0.184846,0.905147
min,0.0,0.0,0.0
skew,,0.745942,
std,,,0.398389


### Exercise 9 - migration data
---

Open the dataset at this url: https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true  Open the sheet named *Country Migration*

1.  Describe the dataset  
2.  Show summary information
3.  Display the mean net per 10K migration in each of the years 2015 to 2019
4.  Display the mean, max and min migration for the year 2019 for each of the regions (*base_country_wb_region*)
5.  Display the median net migration for the years 2015 and 2019 for the base countries by income level
6.  Display the number of target countries in each income level
7.  Display the mean net migration for all five years, for each income level

**Test output**:  
1  count, mean, std, min, 25%, 50%, 75%, max for 9 columns

2  shows 16 columns with non-null count of 4148 in each column

3  
```
net_per_10K_2015    0.461757
net_per_10K_2016    0.150248
net_per_10K_2017   -0.080272
net_per_10K_2018   -0.040591
net_per_10K_2019   -0.022743
dtype: float64
```

4  
```
	net_per_10K_2019
mean	max	min
base_country_wb_region			
East Asia & Pacific	0.198827	21.57	-9.88
Europe & Central Asia	0.208974	87.71	-21.34
Latin America & Caribbean	-0.904602	21.15	-31.75
Middle East & North Africa	-0.107655	55.60	-50.33
North America	0.239246	23.20	-0.29
South Asia	-0.514577	13.72	-24.89
Sub-Saharan Africa	-0.279729	37.11	-21.54
```

5  
```
	net_per_10K_2015	net_per_10K_2019
base_country_wb_income		
High Income	0.02	0.04
Low Income	0.42	-0.05
Lower Middle Income	-0.02	-0.07
Upper Middle Income	-0.03	-0.08
```

6  
```
base_country_wb_income
High Income            2415
Low Income              185
Lower Middle Income     653
Upper Middle Income     895
Name: target_country_name, dtype: int64 
```

7  
```
net_per_10K_2015	net_per_10K_2016	net_per_10K_2017	net_per_10K_2018	net_per_10K_2019
base_country_wb_income					
High Income	0.505482	0.391379	0.314178	0.379201	0.401470
Low Income	1.876432	0.798270	-0.684865	-0.677784	-0.681459
Lower Middle Income	0.591654	-0.029893	-0.519433	-0.527136	-0.476616
Upper Middle Income	-0.043419	-0.502916	-0.699240	-0.686626	-0.700101
```






In [35]:
url = "https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true"
migration = pd.read_excel(url,sheet_name="Country Migration")
migration.groupby('base_country_wb_income')['net_per_10K_2015','net_per_10K_2016','net_per_10K_2017','net_per_10K_2018','net_per_10K_2019'].mean()


  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0_level_0,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019
base_country_wb_income,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
High Income,0.505482,0.391379,0.314178,0.379201,0.40147
Low Income,1.876432,0.79827,-0.684865,-0.677784,-0.681459
Lower Middle Income,0.591654,-0.029893,-0.519433,-0.527136,-0.476616
Upper Middle Income,-0.043419,-0.502916,-0.69924,-0.686626,-0.700101


In [37]:
count_mig = pd.read_excel('https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true',
                         sheet_name="Country Migration")

count_mig.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4148 entries, 0 to 4147
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   base_country_code         4148 non-null   object 
 1   base_country_name         4148 non-null   object 
 2   base_lat                  4148 non-null   float64
 3   base_long                 4148 non-null   float64
 4   base_country_wb_income    4148 non-null   object 
 5   base_country_wb_region    4148 non-null   object 
 6   target_country_code       4148 non-null   object 
 7   target_country_name       4148 non-null   object 
 8   target_lat                4148 non-null   float64
 9   target_long               4148 non-null   float64
 10  target_country_wb_income  4148 non-null   object 
 11  target_country_wb_region  4148 non-null   object 
 12  net_per_10K_2015          4148 non-null   float64
 13  net_per_10K_2016          4148 non-null   float64
 14  net_per_

In [38]:
count_mig.describe()

Unnamed: 0,base_lat,base_long,target_lat,target_long,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019
count,4148.0,4148.0,4148.0,4148.0,4148.0,4148.0,4148.0,4148.0,4148.0
mean,28.418022,21.698305,28.418022,21.698305,0.461757,0.150248,-0.080272,-0.040591,-0.022743
std,25.086012,61.937381,25.086012,61.937381,5.00653,4.201118,3.203092,3.593876,3.633247
min,-40.900557,-106.346771,-40.900557,-106.346771,-37.01,-40.89,-43.66,-56.22,-50.33
25%,14.058324,-3.435973,14.058324,-3.435973,-0.15,-0.19,-0.21,-0.21,-0.21
50%,35.86166,19.145136,35.86166,19.145136,0.0,0.0,0.0,0.0,0.0
75%,47.516231,53.688046,47.516231,53.688046,0.24,0.22,0.16,0.17,0.18
max,64.963051,179.414413,64.963051,179.414413,150.68,124.48,87.0,91.41,87.71


In [39]:
count_mig[['net_per_10K_2015','net_per_10K_2016','net_per_10K_2017','net_per_10K_2018','net_per_10K_2019']].mean()

net_per_10K_2015    0.461757
net_per_10K_2016    0.150248
net_per_10K_2017   -0.080272
net_per_10K_2018   -0.040591
net_per_10K_2019   -0.022743
dtype: float64

In [40]:
count_mig.groupby('base_country_wb_region')['net_per_10K_2019'].agg(['min', 'max', 'mean'])

Unnamed: 0_level_0,min,max,mean
base_country_wb_region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
East Asia & Pacific,-9.88,21.57,0.198827
Europe & Central Asia,-21.34,87.71,0.208974
Latin America & Caribbean,-31.75,21.15,-0.904602
Middle East & North Africa,-50.33,55.6,-0.107655
North America,-0.29,23.2,0.239246
South Asia,-24.89,13.72,-0.514577
Sub-Saharan Africa,-21.54,37.11,-0.279729


In [41]:
count_mig.groupby('base_country_wb_income')[['net_per_10K_2015','net_per_10K_2016','net_per_10K_2017','net_per_10K_2018','net_per_10K_2019']].median()

Unnamed: 0_level_0,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019
base_country_wb_income,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
High Income,0.02,0.03,0.04,0.04,0.04
Low Income,0.42,0.06,-0.17,-0.15,-0.05
Lower Middle Income,-0.02,-0.05,-0.07,-0.07,-0.07
Upper Middle Income,-0.03,-0.06,-0.08,-0.08,-0.08


In [42]:
count_mig.groupby('target_country_wb_income')['target_country_name'].count()

target_country_wb_income
High Income            2415
Low Income              185
Lower Middle Income     653
Upper Middle Income     895
Name: target_country_name, dtype: int64

### Exercise 10 - calculating range over a grouped series

Open the dataset at this url: https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true Open the sheet named *Skill Migration*

1.  Display the max for each skill group category of net migration for the year 2017
2.  Assign the max for each skill group category of net migration for the year 2017 to a variable called **max_skill_migration** and print `max_skill_migration`
3.  Create a second variable called **min_skill_migration** and assign to it the min for each skill group category of net migration for the year 2017, print `min_skill_migration`

4.  You now have two series `max_skill_migration` and `min_skill_migration` each of which is a numpy array.  You can perfom calculations on these two series in the same way as you would individual data items.

So, you can calculate the range for each skill category by subtracting the `min_skill_migration` from `max_skill_migration` to get a new series **skill_migration_range**

skill_migration_range = max_skill_migration - min_skill_migration

Try it out.

5.  Now calculate the range for the year 2019
6.  Now calculate the range for countries grouped by income level for the year 2015

Test output:  
1 and 2  
```
skill_group_category
Business Skills                1048.20
Disruptive Tech Skills         1478.56
Soft Skills                    1572.35
Specialized Industry Skills    1906.14
Tech Skills                    1336.78
Name: net_per_10K_2017, dtype: float64
```

3    
```
skill_group_category
Business Skills               -3471.35
Disruptive Tech Skills        -2646.19
Soft Skills                   -2542.23
Specialized Industry Skills   -6604.67
Tech Skills                   -6060.98
Name: net_per_10K_2017, dtype: float64
```

4  
```
skill_group_category
Business Skills                4519.55
Disruptive Tech Skills         4124.75
Soft Skills                    4114.58
Specialized Industry Skills    8510.81
Tech Skills                    7397.76
Name: net_per_10K_2017, dtype: float64
```

5  
```
skill_group_category
Business Skills                4543.96
Disruptive Tech Skills         3651.81
Soft Skills                    5528.47
Specialized Industry Skills    4036.44
Tech Skills                    3424.45
Name: net_per_10K_2019, dtype: float64
```

6  
```
wb_income
High income            4246.50
Low income             4556.42
Lower middle income    2148.36
Upper middle income    4045.43
Name: net_per_10K_2015, dtype: float64
```





In [43]:
skill_mig = pd.read_excel('https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true',
                        sheet_name='Skill Migration')

skill_mig.head()

Unnamed: 0,country_code,country_name,wb_income,wb_region,skill_group_id,skill_group_category,skill_group_name,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019
0,af,Afghanistan,Low income,South Asia,2549,Tech Skills,Information Management,-791.59,-705.88,-550.04,-680.92,-1208.79
1,af,Afghanistan,Low income,South Asia,2608,Business Skills,Operational Efficiency,-1610.25,-933.55,-776.06,-532.22,-790.09
2,af,Afghanistan,Low income,South Asia,3806,Specialized Industry Skills,National Security,-1731.45,-769.68,-756.59,-600.44,-767.64
3,af,Afghanistan,Low income,South Asia,50321,Tech Skills,Software Testing,-957.5,-828.54,-964.73,-406.5,-739.51
4,af,Afghanistan,Low income,South Asia,1606,Specialized Industry Skills,Navy,-1510.71,-841.17,-842.32,-581.71,-718.64


In [44]:
skill_mig.groupby('skill_group_category')['net_per_10K_2017'].max()

skill_group_category
Business Skills                1048.20
Disruptive Tech Skills         1478.56
Soft Skills                    1572.35
Specialized Industry Skills    1906.14
Tech Skills                    1336.78
Name: net_per_10K_2017, dtype: float64

In [45]:
max_skill_migration = skill_mig.groupby('skill_group_category')['net_per_10K_2017'].max()

print(max_skill_migration)

skill_group_category
Business Skills                1048.20
Disruptive Tech Skills         1478.56
Soft Skills                    1572.35
Specialized Industry Skills    1906.14
Tech Skills                    1336.78
Name: net_per_10K_2017, dtype: float64


In [46]:
min_skill_migration = skill_mig.groupby('skill_group_category')['net_per_10K_2017'].min()

print(min_skill_migration)

skill_group_category
Business Skills               -3471.35
Disruptive Tech Skills        -2646.19
Soft Skills                   -2542.23
Specialized Industry Skills   -6604.67
Tech Skills                   -6060.98
Name: net_per_10K_2017, dtype: float64


In [47]:
skill_migration_range = max_skill_migration - min_skill_migration

print(skill_migration_range)

skill_group_category
Business Skills                4519.55
Disruptive Tech Skills         4124.75
Soft Skills                    4114.58
Specialized Industry Skills    8510.81
Tech Skills                    7397.76
Name: net_per_10K_2017, dtype: float64


In [48]:
skill_migration_range2019 = skill_mig.groupby('skill_group_category')['net_per_10K_2019'].max() \
                          - skill_mig.groupby('skill_group_category')['net_per_10K_2019'].min()

print(skill_migration_range2019)

skill_group_category
Business Skills                4543.96
Disruptive Tech Skills         3651.81
Soft Skills                    5528.47
Specialized Industry Skills    4036.44
Tech Skills                    3424.45
Name: net_per_10K_2019, dtype: float64


In [49]:
skill_migration_range2015 = skill_mig.groupby('skill_group_category')['net_per_10K_2015'].max() \
                          - skill_mig.groupby('skill_group_category')['net_per_10K_2015'].min()

print(skill_migration_range2015)

skill_group_category
Business Skills                3983.53
Disruptive Tech Skills         3558.47
Soft Skills                    3371.63
Specialized Industry Skills    4750.47
Tech Skills                    4418.45
Name: net_per_10K_2015, dtype: float64
