<a href="https://colab.research.google.com/github/codeworkshopou/Data-Management-and-Statistics-in-Python/blob/main/Copy_of_DM2_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data management: Day 2

Throughout this tutorial, we will approach the suitable conditions for bigfoot to thrive. The main objective is to keep practicing `pandas` methods we have already learned from the previous session, as well as complement them with new techniques

# Module import and data reading

In [None]:
import pandas as pd

The dataset was compiled by the [Bigfoot Field Researchers Organization](http://www.bfro.net) (BFRO), and managed in [data.world](https://data.world/). We will only use a subset of the complete dataset:

In [None]:
url_foot = "https://raw.githubusercontent.com/ulises-rosas/code/main/bigfoot.csv"
bigfoot = pd.read_csv(url_foot)
bigfoot.head(2)

Unnamed: 0,state,timestamp,latitude,longitude,temperature_high,temperature_mid,temperature_low,dew_point,humidity,cloud_cover,moon_phase,precip_intensity,precip_probability,pressure,uv_index,visibility,wind_bearing,wind_speed
0,Alabama,1981-09-15T12:00:00Z,32.31435,-85.16235,87.57,78.455,69.34,69.27,0.89,0.71,0.55,0.0364,1.0,1013.77,7.0,5.56,230.0,1.89
1,Alabama,1999-07-15T12:00:00Z,33.28375,-87.32655,86.54,78.44,70.34,70.47,0.8,0.38,0.1,,,1020.03,10.0,7.54,85.0,1.9


# Data exploration

Upon looking at the number of occurrences of bigfoot per state, we can declare that Washington is the state where bigfoot was seen the most:

In [None]:
(bigfoot
    .groupby('state') # group by 'state' values
    .apply(len) # count rows per group
    .sort_values(ascending=False) # sort above counting
    .head(2) # take the first two rows
    )

state
Washington    538
Ohio          283
dtype: int64

Since we applied multiple methods on top of each other, we collapsed lines to improve readability by surrounding lines with parenthesis.

## Question

> What is the year when bigfoot was seen the most?

*Hint*: i) you might need to create a new column from `'timestamp'` that isolates the year information, ii) to isolate the year, you can use the `apply` method that uses a function (i.e., you have to define it) that split strings and returns the first item. 

In [None]:
# your answer

# Correlation

We can obtain the linear correlation between the number of occurrences and the values of a given variable. To do so, we can just group rows by a given variable, and correlate the variable with its frequency. For example, if we want to create a table of latitude vs. frequency, we can run the following line:

In [None]:
freq_lat = bigfoot.groupby('latitude').apply(len)
freq_lat.sample(5)

latitude
27.97255    1
36.16550    1
39.42635    1
41.75840    1
38.83996    1
dtype: int64

However, we notice that the measurement accuracy makes bins with low frequencies. Then, we can round the values `latitude` column by using the `.round` method so that we broaden the bin amplitude 

In [None]:
bigfoot['latitude'] = bigfoot['latitude'].round(decimals = 1)
freq_lat = bigfoot.groupby('latitude').apply(len)
freq_lat.sample(5)


latitude
43.1    10
45.6    21
28.5    11
45.5    16
36.3    18
dtype: int64

Now, we can proceed to obtain the correlation between the latitude and bigfoot occurrencies by using the `corr` method

In [None]:
mycorr = freq_lat.reset_index().corr()
mycorr

Unnamed: 0,latitude,0
latitude,1.0,0.515078
0,0.515078,1.0


Note that before calling `.corr` we needed to transform the row names (i.e., latitude bins) into a column by using the `.reset_index` method. `0` column is the number of bigfoot occurrences. Finally, to get the correlation we just need to specify its position in the table by using the `.iloc` method:

In [None]:
mycorr.iloc[0,1]

0.5150776825659353

## Challenge

From the below list of variables, find the top three variables that highly correlate with bigfoot occurrence 

*Remark*: While the correlation can be positive or negative, the magnitude of the correlation is what we are looking for. A `for-loop` that automatizes the procedure to get the correlation by using the following structure might be a good starting point:
>```python
>out = []
>for c in cols:
>    # your solution
>    out.append([ c, corr ])
>```

It is convenient to store results in a list because it can be easily converted into a data frame with `pd.DataFrame(out, columns=['var', 'corr'])`, and then we can apply methods that we have already worked with

In [None]:
num_cols = [
 'latitude', 
 'longitude', 
 'temperature_high',
 'temperature_mid',
 'temperature_low',
 'dew_point',
 'humidity',
 'cloud_cover',
 'moon_phase',
 'precip_intensity',
 'precip_probability',
 'pressure',
 'uv_index',
 'visibility',
 'wind_bearing',
 'wind_speed'
 ]
# you solution here


# More on sorting

Up to this point, examples showed how to sort rows in a function of one single column. However, the `sort_values` method can hierarchically handle multiple columns (i.e., respecting the proposed column order). For example, we give as input a list of two columns (i.e., `'uv_index'` and `'humidity'`), and specify one to decrease while the other one to increase:

In [None]:
column_list = ['uv_index', 'humidity']

(bigfoot[ column_list ]
    .sort_values( column_list, ascending = [False, True] )
    .head()
)

Unnamed: 0,uv_index,humidity
2096,13.0,0.16
2097,13.0,0.47
299,13.0,0.74
140,12.0,0.08
3075,12.0,0.14


Note how values of humidity increase within the uv_index of 13

## Challenge

Sort the table in the function of the top three variables we found in the previous section, and, from the first 500 sorted rows, find the top two states where bigfoot was more frequently seen. Compare this result with what we obtained before

*Remark*: the sign of the linear correlation must be considered in the sorting process

# Mean, Median, and Standard Deviation

Mean, Median, and Standard Deviation can be directly called by using `mean`, `median`, and `std` methods.

For example, to know what was the average moon phase per state when bigfoot was seen, we can call the `mean` method after grouping rows:

In [None]:
bigfoot.groupby('state')['moon_phase'].mean().head()

state
Alabama       0.43
Arizona       0.52
Arkansas      0.48
California    0.53
Colorado      0.44
Name: moon_phase, dtype: float64

Similarly, we can apply the same strategy for `median` and `std`

## Questions

> what are the 'optimal' conditions for bigfoot based on the three variables we previously found?

*Hint*: We can use quantile information from the `describe` method to subset the dataset

In [None]:
# you answer here

> What is the sparcity of values from these three variables within the 'optimal' condition range?

In [None]:
# you answer here