<a href="https://colab.research.google.com/github/futureCodersSE/python-programming-for-data/blob/main/Worksheets/2_Describing_and_Interrogating_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Describing and Interrogating Data 
---

When using pandas we first need to import it 

` import pandas as pd ` 


When first looking at a dataset, it is important to be able to see information about the data such as summmary statistics, and interrogate it to find information. 

This is the least risky in terms of bias and inaccurate conclusions as it should focus just on what data is presented to us.

### Summary Statistics
--- 

Mean - the average  
Median - middle value / 50% of the data (another type of average)  
Mode - the value that appears the most frequently  
Range - the total range of values (max - min)    

**Functions:**

`mean()` - mean (average)   
`mode()` - mode    
`std()` - standard deviation   
`min()` - minimum value of column     
`max()` - maximum value of column  
`median()` - middle value (median)

### Useful Functions
---
 `head()` will show the first 5 rows of the dataframe. You can show a different amount of rows by putting the number of columns you would like to see in the brackets.    
`tail()` same as head() but for the last 5 rows   
`info()` will show information about the overall dataset, including how many Null values exist in each column, the data type of each column and dataframes length   
`describe()` will show summary statistics for all numeric columns   
`iloc[index]`  will show you row / rows at index position or in index range  
`unique()` will show all the unique values in a column   
`nunique()` will show the number of unique values in a column   
`len()` will show the length (can be used on a list, array, column etc)  

### Interrogating dataframes 
---

To view subsets:

* single column: `dataframe['column'] `
* multiple columns: `dataframe[['column1', 'column2']]`
* columns by criteria: `dataframe[dataframe['column'] == 'criteria']`
* multiple conditions   
`dataframe[(dataframe['column'] == condition1) & (dataframe['column'] == condition2)]`


## Data Retrieval 
---

In order to load in a dataset you will need to retrieve it. The following code retrieves different types of data. 

From a webpage:

` pd.read_html("url")`

From a CSV hosted on Github:

`pd.read_csv("url")`

From an Excel hosted on Github:

`pd.read_excel("url", sheet_name = "sheet name")`




### Exercise 1 - open the Titanic dataset and see descriptive info
---

The Titanic dataset is stored at this URL:
https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv

1. Read the dataset into a pandas dataframe that you will call **titanic**.


2. Write a function called **summary** that will: 
* Display the first 5 rows of the dataset 
* Use info() to display a technical summary of the data
* Use describe() to display a numerical summary of the data 


**Expected Output** 

```
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
       PassengerId    Survived      Pclass  ...       SibSp       Parch        Fare
count   891.000000  891.000000  891.000000  ...  891.000000  891.000000  891.000000
mean    446.000000    0.383838    2.308642  ...    0.523008    0.381594   32.204208
std     257.353842    0.486592    0.836071  ...    1.102743    0.806057   49.693429
min       1.000000    0.000000    1.000000  ...    0.000000    0.000000    0.000000
25%     223.500000    0.000000    2.000000  ...    0.000000    0.000000    7.910400
50%     446.000000    0.000000    3.000000  ...    0.000000    0.000000   14.454200
75%     668.500000    1.000000    3.000000  ...    1.000000    0.000000   31.000000
max     891.000000    1.000000    3.000000  ...    8.000000    6.000000  512.329200

[8 rows x 7 columns]
```

In [None]:


titanic = #read your dataset into this variable (dont forget to import pandas first)



def get_summary():
  # add code below which prints the first 5 rows of the dataset, the info and the numerical summary





# run and visually test using example above
get_summary()



### Exercise 2 - displaying other statistics
---

Take a look at the list of methods available for giving summary statistics [here](https://pandas.pydata.org/docs/user_guide/basics.html#basics-stats).  (You will need to use `.mode()` in this exercise) 

Use panda functions, and your existing knowledge, to display the following summary statistics from the titanic dataset:

Write a function called **get_statistics()** which returns:

1.  The total number of passengers on the titanic
2.  The age of the youngest passenger
3.  The most expensive ticket price
4.  The range of ticket prices
5.  The number of passengers with cabins
6.  The code for the port where the highest number of passengers embarked
7.  The most populous gender
8.  The standard deviation for age

In [None]:
def get_statistics():
  # add code below to return the stats listed above 




  return passengers, youngest, most_expensive, range_ticket, no_cabins, embarked[0], gender[0], sd 


# This will run and test your function to see if answers are correct 
actual = get_statistics()
expected = (891, 0.42, 512.3292, 512.3292, 204, 'S', 'male', 14.526497332334044)

if actual == expected:
  print("Test passed", actual)
else: 
  print("Test failed, expected", expected, "but got", actual)


### Exercise 3 - aggregating statistics grouped by category
---

Refer again to the tutorial  
[How to calculate summary statistics?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html#)   
looking particularly at the section on Aggregating statistics grouped by category.

Write a function called **grouped** which displays:

1.  The mean age for male versus female Titanic passengers?
2.  The mean ticket fare price for each of the sex and cabin class combinations?
3.  The mean ticket fare price for passengers who embarked at each port?
4.  Which passenger class had the highest number of survivors (for now, just show the statistics - how many survivors in each class - you can identify the class visually) 

**Expected output**
```
            Age
 Sex              
 female  27.915709
 male    30.726645, 
 
 Pclass  Sex   
 1       female    106.125798
         male       67.226127
 2       female     21.970121
         male       19.741782
 3       female     16.118810
         male       12.661633

 Name: Fare, dtype: float64 
 
 Embarked
 C    59.954144
 Q    13.276030
 S    27.079812
 Name: Fare, dtype: float64
 
       Survived
 Pclass
 1          136
 2           87
 3          119
 ```

In [None]:
def get_grouped():
  # add code below to return the above stats 






# run and test visually using the above expected output 
get_grouped()



### Exercise 4 - using iloc
---
Write a function called **get_middle** to:
*   display the middle 20 records (use the shape of the dataframe to help you identify the index positions of these)


In [None]:
def get_middle():
  # add code below to return middle 20 records 




# run and test if your returned 20 records starts at correct index 

actual = get_middle().index[0]
expected = 436

if actual == expected:
  print("Test passed", actual)
else: 
  print("Test failed expected index of", expected, "got", actual)

# If you are failing the test by 1 out, why might this be? Think about what happens when you use floor division vs the round() function 



### Exercise 5 - migration to and from
---

The Excel file at this link (which you have already opened above): https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true has three data sheets, "Country Migration", "Industry Migration" and "Skill Migration"

Read the data sheet "Country Migration" into a variable called **country** 

Write a function called **get_uk_mig** that will return all the rows which had migration to the United Kingdom.  

_Hint:_ You should use this syntax for this:

```
uk_df = country_df[..condition...]
```
This will create a new dataframe with only the rows that meet the condition.  In this case the condition might be:
```
country_df['target_country_name'] == "United Kingdom"
```
or
```
country_df['target_country_code'] == "gb"
```
Both will give the same list.

In [None]:
def get_uk_mig():
  # add code below to return all rows which had migration to the UK



# run and test if your returned series is the correct length 

actual = len(get_uk_mig())
expected = 122

if actual == expected:
  print("Test passed", actual)
else: 
  print("Test failed expected", expected, "got", actual)



### Exercise 6 - how many countries are migrated to

Using the "Country Migration" sheet again, get the total number of unique country names of where people have migrated from.


In [None]:
def migration():
  #add code below to return the total number of unique country names of where people have migrated from 




# run and test if you have the correct number of unique countries 
actual = migration()
expected = 140

if actual == expected:
  print("Test passed", actual)
else: 
  print("Test failed expected", expected, "got", actual)


# Reflection
----

## What skills have you demonstrated in completing this notebook?

Your answer:

## What caused you the most difficulty?

Your answer: