# 2A: Practice

In this exercise, you will practice selecting and transforming data using the Numpy and Pandas Python libraries. You can search on the web for documentation and techniques to help you solve the problem. See especially the official [Numpy documentation](https://numpy.org/doc/stable/) and [Pandas documentation](https://pandas.pydata.org/docs/user_guide/index.html).

For every sub-question under every Question, please save results in corresponding variables. For instance, results of Question 1 should be saved in variables named `q1_1`, `q1_2`, ... , `q1_10`. 

## Part 1: Numpy
In this part you will practice with the functions and conventions of the Numpy library using Numpy arrays. You may find it helpful to start by looking at the official Numpy Quickstart [tutorial](https://numpy.org/doc/stable/user/quickstart.html).

In [1]:
# Run this code cell to import the Numpy library
import numpy as np

### Question 1
This question provides some exercise for basic indexing, slicing, and aggregating functions in Numpy. The code below generates a 10 by 10 numpy array of random integers between 0 and 100 (inclusive).

In [2]:
# Run but do not modify this code
np.random.seed(0) # so you get the same results each time
vals = np.random.randint(0, 100, (10,10)) # from 0 to 100, shape 100 by 100
print(vals)

[[44 47 64 67 67  9 83 21 36 87]
 [70 88 88 12 58 65 39 87 46 88]
 [81 37 25 77 72  9 20 80 69 79]
 [47 64 82 99 88 49 29 19 19 14]
 [39 32 65  9 57 32 31 74 23 35]
 [75 55 28 34  0  0 36 53  5 38]
 [17 79  4 42 58 31  1 65 41 57]
 [35 11 46 82 91  0 14 99 53 12]
 [42 84 75 68  6 68 47  3 76 52]
 [78 15 20 99 58 23 79 13 85 48]]


Without writing any loops, write code to answer the following questions where question 1's answer is in the variable `q1_1`, question 2's answer is in `q1_2`, and so on.

1. The value at row index 2 and column index 1.
2. The last value in the first row.
3. The largest value overall.  
4. All of the values in the first column.
5. The average of the values in the first row.
6. The minimum of the values in the last column.
7. The average of the values in the first 5 rows and the first five columns.
8. The five smallest values in the first row, in order from least to greatest.
9. The minimum value of every row.
10. The index of the column with the largest average value.

In [246]:
# Write your code to answer Question 1 here
q1_1 = vals[2][1]
q1_2 = vals[0][-1]
q1_3 = np.amax(vals)
q1_4 = vals[:,0]
q1_5 = np.mean(vals,axis=1)[0]
q1_6 = np.min(vals,axis=0)[-1]
q1_7 = np.mean(vals[:5,:5])
q1_8 = np.sort(vals[0])[0:5]
q1_9 = np.min(vals, axis=1)
q1_10 = np.argmax(np.average(vals, axis=0))

### Question 2
This question provides some exercise for fast element-wise computation with Numpy ufuncs. The code below generates a random 2d numpy array `points` that contains 5 points, each of which has an x-value and a y-value. Each row corresponds to a point, the first column (column index 0) contains the x-value, and the second column (column index 1) contains the y-value.

In [35]:
# Run but do not modify this code
np.random.seed(0) # so you get the same results each time
points = np.random.randn(5, 2) # draws values from standard normal distribution
print(points)

[[ 1.76405235  0.40015721]
 [ 0.97873798  2.2408932 ]
 [ 1.86755799 -0.97727788]
 [ 0.95008842 -0.15135721]
 [-0.10321885  0.4105985 ]]


Without writing any loops, write code answer the following questions where quesiton 1's answer is in the variables `q2_1`, question 2's is in `q2_2`, and so on.

1. Two times every value in `points`
2. The square of every x-value.
3. The 0-centered y-values: Subtract the average over all y-values from every y-value.
4. The average of the *positive* values in `points.`
5. The points with x-values greater than their y-values.
6. The magnitude of the first point. The magnitude of a point $(x, y)$ is it's Euclidean distance from the origin equal to $\sqrt{x^2 + y^2}$. 
7. The Euclidean distance between the first and the second point. The Euclidean distance formula between two points $(x_1, y_1)$ and $(x_2, y_2)$ is $\sqrt{(x_2-x_1)^2 + (y_2-y_1)^2}$.
8. The Euclidean distance between the first point and all points (you may include the distance of 0 from the first point to itself).

In [270]:
# Write your code to answer Question 2 here
q2_1 = points * 2
q2_2 = np.power(points[:,0],2)
q2_3 = points[:,1] - np.mean(points[:,1])
q2_4 = np.mean(points[points>0])
q2_5 = points[points[:,0]>points[:,1]]
q2_6 = np.sqrt(np.power(points[0,0],2) + np.power(points[0,1],2))
q2_7 = np.sqrt(np.power(points[1,0]-points[0,0], 2) + np.power(points[1,1]-points[0,1], 2))
q2_8 = [ np.linalg.norm(points[i] - points[0]) for i in range(len(points)) ]

## Part 2: Pandas
In this part you will practice manipulating Pandas Series and Dataframes. You may find it helpful to start by looking at the *10 minutes to pandas* [tutorial](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html).

In [92]:
# Run this code cell to import the Pandas library
import pandas as pd

### Question 3
The dataset `temp` imported below as a Pandas series shows temperature deviations (in Celsius) relative to the 1951-1980 average, courtesy of [NASA](https://climate.nasa.gov/vital-signs/global-temperature/).

In [93]:
# Run but do not modify this code
temp = pd.read_csv("climate.csv", index_col=0, squeeze=True)
print(temp.head())
temp.plot()

year
1880   -0.16
1881   -0.08
1882   -0.10
1883   -0.16
1884   -0.28
Name: temp_dev, dtype: float64


<matplotlib.axes._subplots.AxesSubplot at 0x7f9a0768dd10>

Without writing any loops, write code to answer the following questions where answers are in the variables `q3_1`, `q3_2`, and so on.

1. The temperature deviation in 1900?
2. Temperature deviations for the last three years (most recent) for which there is data
3. Temperature deviations for the last three years (most recent) for which there is data converted to Fahrenheit (temperature deviation in Fahrenheit is just 1.8 times the Celsius).
4. The average temperature deviation since 1990 (including 1990).
5. The number of years with positive temperature deviations.
6. The number of those years with positive temperature deviations taking place since 1970 (including 1970).
7. The five hottest years on record in the dataset.

In [190]:
# Write your code to answer Question 3 here
q3_1 = temp.loc[1900]
q3_2 = temp.tail(3)
q3_3 = temp.tail(3)*1.8
q3_4 = temp.loc[1990:].mean()
q3_5 = np.count_nonzero((temp.values>0))
q3_6 = np.count_nonzero((temp.loc[1970:].values>0))
q3_7 = temp.nlargest(5)

### Question 4
The dataset `df` imported below as a Pandas dataframe shows statistics about nations and the happiness of citizens from the United Nations [World Happiness Report](https://worldhappiness.report).

In [195]:
# Run but do not modify this code
df = pd.read_csv("happiness.csv", index_col = "Country name")
df.head()

Unnamed: 0_level_0,Happiness,Log GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
Country name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Finland,7.8087,10.639267,0.95433,71.900825,0.949172,-0.059482,0.195445
Denmark,7.6456,10.774001,0.955991,72.402504,0.951444,0.066202,0.168489
Switzerland,7.5599,10.979933,0.942847,74.102448,0.921337,0.105911,0.303728
Iceland,7.5045,10.772559,0.97467,73.0,0.948892,0.246944,0.71171
Norway,7.488,11.087804,0.952487,73.200783,0.95575,0.134533,0.263218


Without writing any loops, write code to answer the following questions answers are in the variables `q4_1`, `q4_2`, and so on.

1. All of the data corresponding to the Country name `United States`. 
2. All of the data corresponding to the country at row index 50.
3. The average happiness score.
4. The number of countries with `Happiness` scores greater than 6.
5. The number of countries with `Log GDP per capita` less than 9.
6. The number of countries with `Happiness` scores greater than 6 and Log GDP per capita less than 9.

In [228]:
# Write your code to answer Question 4 here
q4_1 = df.loc["United States"]
q4_2 = df.iloc[50]
q4_3 = df.mean()["Happiness"]
q4_4 = len(df[df["Happiness"]>6])
q4_5 = len(df[df["Log GDP per capita"]<9])
q4_6 = len(df[(df["Happiness"]>6) & (df["Log GDP per capita"]<9)])