# Lab-Data-Manipulation (PART-1)

In [None]:
import pandas as pd
import numpy as np

# Context 

For this lab you'll use a dataset for UFO observations. The objective is to exercise the manipulation of a dataframe, so we'll use the tools we've learned about `reading`, `renaming`, `selecting specific columns`, `filtering based on conditions` and `merging` dataframes to try to better understand our dataset and store an enriched version of our dataset at the end.

Good Luck.

variable	|class|	description
------------|-----|-------------
date_time	|datetime (mdy h:m)	| Date time sighting occurred
city_area	|character	        | City or area of sighting
state	    |character          |	state/region of sighting
country	    |character          |	Country of sighting
ufo_shape	|character          |	UFO Shape
encounter_length	|double     |	Encounter length in seconds
described_encounter_length	|character |	Encounter length as described (eg 1 hour, etc
description	|character          |	Description of encounter
date_documented	|character      |	Date documented
latitude	|double             |	Latitude
longitude	|double             |	Longitude

## Read the dataset and store it in a dataframe called `ufo`

Pay attention at the file separator.

In [None]:
ufo = pd.read_csv('ufo.csv', sep=';')
ufo.head(2)

## Check the first 6 columns of the dataframe

In [None]:
ufo.columns[:7]

## Check the shape of your dataframe to see how many rows and columns it has

In [None]:
ufo.shape

## Bring the date information to the beginning of the dataframe

If you check the dataframe columns, there are some information of the date at the end of the dataframe. For this task, you should reorder the columns in a way that the first few columns all show the date information. 

*Hint: Use the ufo.columns to see all the column names you have.

In [None]:
ufo = ufo[[
    'date',
    'year', 'month', 'day', 'date_time',
    'date_documented',
    'Unnamed: 0',
    'city_area', 'state', 'country',
    'ufo_shape', 'encounter_length',
    'described_encounter_length', 'description', 
    'latitude', 'longitude'
]]

## Just check if you did it the right way. Take a look at the head of the dataframe again and see if the `ufo` dataframe now is reordered.

In [None]:
ufo.head(2)

## Select a piece of your dataframe. Create a new dataframe called `ufo_vars` and select only the following columns of the `ufo` dataframe. 

`year`, `month`, `state`, `country`, `ufo_shape`, `encounter_length`

In [None]:
ufo_vars = ufo[['year', 'month', 'state', 'country', 'ufo_shape', 'encounter_length']]

Perform a *.head()* on your result to check if you did it right.

In [None]:
ufo_vars.head(2)

## Rename the variable `encounter_length` to `encounter_seconds`. Keep using the `ufo_vars` dataset for the following tasks, unless specifically specified.

Again, check your results to check if you did it right.

In [None]:
ufo_vars = ufo_vars.rename(columns = {'encounter_length': 'encounter_seconds'})

## Let's start filtering some records. Create a new dataframe called `ufo_us` and filter the `ufo_vars` dataframe bringing only the results in which the `country` is `"us"`

### Use a mask to perform this task 

A `mask` is nothing more than a condition. This condition is applied to your whole dataframe (or pandas Series).
So for example, if you had a pandas Series with a variable called `Age`, you could create a mask for all people whose `Age` is less than 18 years old using the syntax:

`df['Age'] <= 18`

This would return a pandas series containing `True` and `False` values. For each index, you'd get a value of `True` or `False`.

You could save this mask in a variable, for example:

`condition = (df['Age'] <= 18)`

And then you could use that variable `condition` to select only the cases of the dataframe in which the index returned `True` using:
df[condition].

In [None]:
ufo_us = ufo_vars.loc[ufo_vars['country'] == 'us']
ufo_us

### Use the .query() method to perform the same task

Remember that the .query() method expects a string. That string should contain the variable of your dataframe without quotation marks and the comparison. For example, if you had a variable called `name`, you'd use a syntax like:
 `df.query('name == "Jack"')`
 
to bring all observations whose column `name` is exactly equal to `"Jack"` (note that Jack should be within quotation marks because a name is a string in this example).

In [None]:
ufo_us = ufo_vars.query('country == "us"')
ufo_us

See which one do you prefer the most keep using it for the exercises that follow

## For the `ufo_us` dataframe, select only the cases in which the year is in the first decade (2001-2010). Put that in a variable called `ufo_us_2000`.

Check your results.

In [None]:
# Tried:
# ufo_us['year'] in range(2001, 2011)

# Tried:
# first_decade = list(range(2001, 2011))
# ufo_us['year'] in first_decade

ufo_us_2000 = ufo_us.query('2001 <= year <= 2010')

ufo_us_2000 = \
    ufo_us[
        (ufo_us['year'] >= 2001) & \
        (ufo_us['year'] <= 2010)
    ]
    
ufo_us_2000

## Try to do the same without the intermediate step of creating the `ufo_us` dataframe. That is, try to filter the dataset for the cases in which the country is "us" and the year is (2001-2010)



*Hint:* You have to make sure all of these conditions are applied simultaneously - using the `and` (or `&`) operator. Try to understand when to use the `and` and the `&` operator.

In [None]:
ufo_vars[
    (ufo_vars['country'] == 'us') & \
    (ufo_vars['year'] >= 2001) & \
    (ufo_vars['year'] <= 2010)
]

## BONUS 1:  Compare the number of triangular UFO occurrances (checking the `ufo_shape` variable) in the US from the year 2001 up to 2010 as compared to the years of 1991 up to 2000.

*Hint: you should expect roughly 3 times more cases for 2001-2010 than 1991-2000.*

In [None]:
cond_ufo_us_triangular = \
    (ufo['country'] == 'us') & \
    (ufo['ufo_shape'] == 'triangle')

cond_ufo_us_triangular_2001_2010 = \
    cond_ufo_us_triangular & \
    (ufo['year'] >= 2001) & \
    (ufo['year'] <= 2010)

cond_ufo_us_triangular_1991_2000 = \
    cond_ufo_us_triangular & \
    (ufo['year'] >= 1991) & \
    (ufo['year'] <= 2000)

In [None]:
ufo[cond_ufo_us_triangular_2001_2010].shape

In [None]:
ufo[cond_ufo_us_triangular_1991_2000].shape

How many rows does each dataset have?

## BONUS 1.1: How many values does each category of `ufo_shape` has. 

*Hint: Remember last class*

In [None]:
ufo['ufo_shape'].value_counts()

# Lab-Data-Manipulation (PART-2)

The second part of this lab consists of grouping and merging results.

# Grouping up the results. 

## Let's calculate the average of the encounter for each country.

We should now group the results by the country column to see what is the mean encounter_seconds for each country. Do this using the `groupby` method of your dataframe `ufo_vars`. What is the average of the encounter for the us? And for Canada?

Remember that after grouping by a column, you have to specify a `aggregating function`. If you don't do that, the results of the groupby will only be a `groupby` pandas object. For this case, we want the aggregating function to be the `mean` function and then the results will appear for us.

Also remember that **if you don't** specify the `as_index=False` argument, the variables you use to group are going to become your new indexes.

In [None]:
ufo_vars.groupby(by = 'country', as_index = False).mean()

## Perform the same task, but instead of calculating the mean, count how many occurrances for each country.

For this case, the aggregating function should be the `count` function. Try to understand the results for each column.

In [None]:
ufo_vars.groupby(by = 'country', as_index = False).count().head(2)

In [None]:
# Há linhas com valores vazios, por isso a diferença nas contagens
ufo_vars[ufo_vars['ufo_shape'].isnull()].head(2)

## Perform the same task, but instead of calculating the mean, use the `.describe()` aggregating function to see the effects.

The describe aggregating function will show you several important statistics for the grouped results, such as `mean`, `median`, `standard deviation`, `count`, `max`, `min`, and so on.

*Hint: If it starts to get difficult to see the results, you can tranpose the resulting dataframe by just putting a `.T` at the end.*

In [None]:
ufo_vars[ufo_vars['ufo_shape'].isnull()].describe()

## Now, let's get deeper in the analysis and group the results not only by country. But by `country` and `year`

### Check the values of the mean and count for the `encounter_seconds` variable for each year. Can you see some discrepancy?

*Hint*: If you want, you can use the `ufo_us` dataset just to see the results for the united states. You could also (in a hacky way) perform the filter right before the groupby operation if you wanted.

In [None]:
pd.__version__

In [None]:
ufo_agg = ufo_vars[['country', 'year', 'encounter_seconds']] \
    .groupby(by = ['country', 'year']) \
    .agg(
        mean = ('encounter_seconds', 'mean'),
        count = ('encounter_seconds', 'count')
    ) \
    .reset_index()
ufo_agg_us = ufo_agg[ufo_agg.country == 'us']
ufo_agg_us

In [None]:
ufo_agg_us.head(10)

In [None]:
ufo_agg_us[30:40]

In [None]:
ufo_agg_us.tail(10)

# BONUS 2: Which months are the ones with the highest numbers of occurrences?

Use the groupby function to `count` the number of ocurrences using the `month` variable as `key` for the groupby. Which are the months that UFOs appears the most?

*hint: The best way to visualize is to select the key and a single variable, for example ['month','year'] to check the results*

In [None]:
ufo[['month', 'year']] \
    .groupby(by = ['month'], as_index = False) \
    .count() \
    .sort_values(by = 'year', ascending = False)

# BONUS 3: Finally, you gathered information about the UFO dataset. Using your last result, try to bring that information for your original dataset.

1. Store the results of your previous analysis (the mean value for the encounter_seconds for each year and each country) in a dataframe called `avg_results`. Remeber, in this case, to pass the argument `as_index=False` to your groupby method to keep the `keys` as columns.

In [69]:
avg_results = \
    ufo_vars[['country', 'year', 'encounter_seconds']] \
        .groupby(by = ['country', 'year']) \
        .agg(
            mean = ('encounter_seconds', 'mean'),
            count = ('encounter_seconds', 'count')
        ) \
        .reset_index()

In [74]:
avg_results = \
    ufo_vars[['country', 'year', 'encounter_seconds']] \
        .groupby(by = ['country', 'year'], as_index = False)

In [75]:
avg_results.agg(mean = ('encounter_seconds', 'mean'))

Unnamed: 0,mean
0,au
1,au
2,au
3,au
4,au
...,...
277,us
278,us
279,us
280,us


In [71]:
a = ufo_vars[['country', 'year', 'encounter_seconds']] \
        .groupby(by = ['country', 'year'])

In [None]:
a.agg(	
)

In [70]:
avg_results

Unnamed: 0,country,year,mean,count
0,au,1958,2700.000000,1
1,au,1960,180.000000,1
2,au,1967,300.000000,1
3,au,1968,300.000000,1
4,au,1972,403.333333,3
...,...,...,...,...
277,us,2010,2271.987232,3548
278,us,2011,2544.292555,4379
279,us,2012,10640.644916,6320
280,us,2013,1266.387888,6056


2. Rename the column named `encounter_seconds` to `avg_encounter_seconds`.

3. Use the pd.merge( ... ) function to bring that new collected information to your original dataset.
The pd.merge() function requires several arguments, let's understand the most important ones.

`left` is the dataframe you want to bring information **to** - the table on the left. In this case, this will be our original dataframe called `ufo`

`right` is the dataframe you want to bring information **from** - the table on the right. In this case, this will be our resulting dataframe `avg_results`.

`on` is the key you want to perform the merge. That is, if those values are **exactly equal** in both dataframes, then the information will be brought.

Put your results on a dataframe called `merged_ufo`

Check how many rows the final result has and try to explain it. Did the dataset get smaller? Bigger? Or the same? Can you explain why? 

*hint: This has to do with the `how` argument you *used* (or rather its default value) in pd.merge() method.*

## Store the results into a new csv file called `ufo_enriched.csv`. 

Don't forget to use `index=False`.