# Intro

This course covers the essential Python skills you'll need so you can start using Python for data science. The course is ideal for someone with some previous coding experience who wants to level up their Python skills.

If you are a first-time coder, you may want to check out these ["Python for non-programmers"](https://wiki.python.org/moin/BeginnersGuide/NonProgrammers) learning resources.


## Learning Objectives

You will know how to:

1. Calculate measures of location

2. Calculate measures of dispersion

3. Load and summarise a dataset by its location and dispersion


---
<h1>Introduction</h1>
Using python we can calculate measures of location and dispersion in a number of ways.
First we can calculate them manually as we did in the videos. For example the mean of the temperatures 6, 10, 9, 9, 12, 6, 2:

In [None]:
(6 + 10 + 9 + 9 + 12 + 6 + 2) / 7

7.714285714285714

Using a programming language like Python, we can also calculate the sum $\sum_i x_i$ using a loop (if this seems unclear, do not worry, you don't need to know how to do this yet!):

In [None]:
sum_x = 0 # this will store the sum of all the temperatures

for x_i in [6, 10, 9, 9, 12, 6, 2]: # x_i will be set to each value in the list of temperatures

    sum_x = sum_x + x_i 
    print(f'x_i is equal to {x_i} and now the sum so far equals {sum_x}')

print(f'To calculate the mean we divide the sum by the number of temperatures {sum_x}/7 = {sum_x/7}')

x_i is equal to 6 and now the sum so far equals 6
x_i is equal to 10 and now the sum so far equals 16
x_i is equal to 9 and now the sum so far equals 25
x_i is equal to 9 and now the sum so far equals 34
x_i is equal to 12 and now the sum so far equals 46
x_i is equal to 6 and now the sum so far equals 52
x_i is equal to 2 and now the sum so far equals 54
To calculate the mean we divide the sum by the number of temperatures 54/7 = 7.714285714285714


However, for common functions like the mean, we do not need to write the code ourselves, but instead we can use functions from libraries such as numpy, which we will look at in this homework.

---
## Task 1: Getting our Wampis Global Dataset Into Python


In this homework, you will load an external dataset `GlobalWampis.csv`.  See **Annex** to learn about the Wampis Global Dataset and how to load it into python.




📚 QUESTION **A**
#### **A.** Load the Wampis Global dataset in to a pandas Dataframe and check the **columns** and first **5 records**.

⚡ MY ANSWER **A**

In [None]:
# Insert here your code

---
## Task 2: Understanding the Wampis Global Dataset

In this task, we need to calculate some statistics from the Wampis Global Dataset. 
Often when faced with data, a first step is to compute summary statistics. 

To do so, we use [NumPy](https://numpy.org/doc/stable/index.html). It has fast built-in aggregation functions for working on arrays. Read the following [notebook](https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.04-Computation-on-arrays-aggregates.ipynb#scrollTo=VEcNBuCvCYM3), it will help you with the computations in python.

📚 QUESTION **B**
### **B.** Select a ratio variable (e.g.`height`) and calculate the following statistics:

* Minimum
* Maximum
* 25th percentile
* Median
* 75th percentile
* Mode
* Mean (average)
* Variance
* Standard deviation


⚡ MY ANSWER **B**

In [None]:
from scipy import stats # see annex (Finding the mode with python) for further information

print("Minimum height:    ", #insert your code here)
print("Maximum height:    ", #insert your code here)
print("25th percentile:   ", #insert your code here)
print("Median:            ", #insert your code here)
print("75th percentile:   ",  #insert your code here)
print("Mode height:       ", stats.mode(df['height'])) # see annex
print("Mean height:       ",  #insert your code here)
print("Variance:          ",  #insert your code here)
print("Standard deviation:",  #insert your code here)

## Task 3: Visualising the Wampis Global Dataset





Of course, it's more useful to see a visual representation of this data, which we can accomplish using tools in `plotly.express`.

📚 QUESTION **C**

###### **C.** Create a histogram to represent the variable `height`.

⚡ MY ANSWER **C**

In [None]:
import plotly.express as px
fig = px.histogram(#insert your code here)
fig.show()

## Task 4: The importance of outliers



Sometimes our data can contain outliers, which are data points that are very different to the rest of the data. For example if someone entered their `height` wrong as 1800 (instead of 180), then this error would be an outlier. Click on run the following cell code:

In [None]:
# add 1800 to the df dataset
df_outlier = df.append({'age':22, 'height': 1800}, ignore_index=True)
# print last row of df_outlier dataset
df_outlier.tail(2)

Unnamed: 0,age,gender,height,region,n.languages,happy,financial,politics,global
699,36.0,Female,177.0,e5543,2.0,Strongly Agree,Agree,5.0,no
700,22.0,,1800.0,,,,,,


📚 QUESTION **D**

##### **D.** Calculate the measures of location on this dataset with one outlier (e.g. `df_outlier`). How do these values compare to the measures of location for the data without outliers? Which measures of location are robust to outliers?

In [None]:
from scipy import stats
print("Minimum height:    ", #insert your code here, #insert your code here)
print("Maximum height:    ", #insert your code here, #insert your code here)
print("25th percentile:   ", #insert your code here, #insert your code here)
print("Median:            ", #insert your code here, #insert your code here)
print("75th percentile:   ", #insert your code here, #insert your code here)
print("Mode height:       ", #insert your code here, #insert your code here)
print("Mean height:       ", #insert your code here, #insert your code here)
print("Variance:          ", #insert your code here, #insert your code here)
print("Standard deviation:", #insert your code here, #insert your code here)

---
## Annex: More information

### Research Project specifications


![](https://www.caaap.org.pe/wp-content/uploads/2017/11/AsambleaGTANW_Ago2017.jpg)

| Category            | Information |
| ---------------------- | ------------ |
| Research objective(s):        | Globalization and its impact on indigenous culture         |
| Research question(s): | Does globalization help or hurt Wampis indigenous?          |
| Data collection methods:       | Individual survey          |
| Region:      | South America          |
|Country: | Peru |
| Coverage:   | Several villages of Wampis Nation          |

### Wampis Dataset description
| file                |Name| Observations (rows) | Variables (cols) |
----------------------------------------------------------------------------------------------------- |--| ----- | ------- |
| [GlobalWampis.csv](https://maastrichtuniversity-ids-open.s3.eu-central-1.amazonaws.com/global-studies/GlobalWampis.csv) |Wampis Global Dataset| 700   | 9       |



### Variable's description


| Name       | Question                                |
| ---------- | --------------------------------------- |
| `age`      | How old are you?          |
| `gender`   | What is your gender identity?                 |
| `height`    | What is your height (in cm)? |
| `region`    | Region of the Wampis Nation|
| `n.languages`      | How many language you speak at home to your family?                    |
| `happy`    | Indicate how strong you agree/disagree with the following sentence: [Q1. I am happy with my life]          |
| `financial`    |Indicate how strong you agree/disagree with the following sentence: [Q2. I am satisfied with the financial situation of my household]          |
| `politics`    |How interested would you say you are in politics?|
| `global`    |Do you think globalization process helps your community?|

### Getting Global Wampis dataset into python
The first two lines of code we write will allow us to get our data set into Python and our Google Colab Notebook so that we can start working with it.

In [None]:
import pandas as pd
url = 'https://maastrichtuniversity-ids-open.s3.eu-central-1.amazonaws.com/global-studies/GlobalWampis.csv'
df = pd.read_csv(url)
# print first 2 observations
df.head(2)

Unnamed: 0,age,gender,height,region,n.languages,happy,financial,politics,global
0,34.0,Male,186,e4435,1.0,Strongly Agree,Strongly Agree,2.0,yes
1,34.0,Male,157,e5442,2.0,Strongly Agree,Strongly Agree,3.0,yes


### Finding the Mode with Python

Unfortunately calculating the mode is not supported by the in-built numpy library and that is why this code is visible. We need to install ```from scipy import stats```. See [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html) for further information.


In [None]:
# example: find the mode in age variable
from scipy import stats
array = df['age'] # create 3 numbers
print(array)
print(stats.mode(array)) # calculate the mode

0      34.0
1      34.0
2      27.0
3      35.0
4      34.0
       ... 
695    26.0
696    38.0
697    34.0
698    23.0
699    36.0
Name: age, Length: 700, dtype: float64
ModeResult(mode=array([29.]), count=array([36]))
