---

# 3. Pandas

Pandas is the best known Python library for manipulating and analyzing data. It is built on top of NumPy, so many features are similar. We will use Pandas to work with structured datasets.

Just as NumPy provides us with arrays and with them we access many new features, Pandas provides us with DataFrames and Series. By far the most used object is the first one, DataFrames.

We are going to use the open data of the Argentine government, so you will have to download the csv from the following link: [Names 2010-2014](https://www.datos.gob.ar/dataset/otros-nombres-personas-fisicas)

In [96]:
import pandas as pd

## Reading a csv file

In [97]:
df_names = pd.read_csv(
    "https://infra.datos.gob.ar/catalog/otros/dataset/2/distribution/2.20/download/nombres-2010-2014.csv"
)
df_names

KeyboardInterrupt: 

In [None]:
df_names.dtypes

## Columns renaming

First of all let's rename the columns to `name`, `amount` and `year`

In [None]:
df_names.rename(
    columns={"nombre": "name", "cantidad": "amount", "anio": "year"}, inplace=True
)
df_names

## Some Pandas useful functions

**TODO:** Investigate the functions that are implemented in the next cell. What do they do? What do you think they can be useful for?

In [None]:
# df_names.head()
# df_names.tail()
# df_names.tail()
# df_names.count()
df_names.shape

## Append a new row

**TODO:** Suppose that in the data load, someone forgot to add a name and its respective amount and year.

Let's add to our dataset the following row with said information:

Name: "Daenerys Stormborn of the House Targaryen, First of Her Name de ella, the Unburnt, Queen of the Andals and the First Men, Khaleesi of the Great Grass Sea, Breaker of Chains, and Mother of Dragons"

Amount: 100
Year: 2011

In [None]:
new_row = pd.DataFrame(
    {
        "name": [
            "Daenerys Stormborn of the House Targaryen, First of Her Name de ella, the Unburnt, Queen of the Andals and the First Men, Khaleesi of the Great Grass Sea, Breaker of Chains, and Mother of Dragons"
        ],
        "amount": [100],
        "year": [2011],
    }
)

# Concatenate the new DataFrame with the original DataFrame
df_names = pd.concat([df_names, new_row], ignore_index=True)

df_names

In [None]:
# Complete this cell with your code

counts = df_names.groupby(["name", "year"]).sum()
# filter for names with counts greater than 2000
filtered_counts = counts[counts["count"] > 2000]
# reset the index to make the results easier to work with
filtered_counts = filtered_counts.reset_index()
# print the results
filtered_counts

**TODO:** Investigate the columns and index functions. What do they do? What data type is their output? What known data type do they resemble?

In [None]:
df_names.columns

In [None]:
df_names.index

## Add a new column

**TODO:** Add a column to the dataframe that corresponds to the number of characters in each name

In [None]:
# Complete this cell with your code

df_names["name_length"] = df_names["name"].apply(lambda x: len(x))

## Filtering by mask

Its implementation is very similar in both NumPy and Pandas, so we will see how to do it first in NumPy then in Pandas.

Suppose we make 100 rolls of a die, but we want to select only those rolls that were less than four. How can we do it?

In [None]:
import numpy as np

dice = np.random.randint(1, 7, size=100)
print(dice)

What we can do is create a mask:

In [None]:
mask = dice > 3
print(mask)
print(type(mask))

In [None]:
print(dice[mask])

In [None]:
print(dice.sum())

In [None]:
print(dice[dice > 3])

**TODO:** Going back to our dataset, suppose we want to keep those rows with names that were repeated more than 2000 times in the corresponding year. Note that in the result a name may appear more than once in different years.

In [None]:
# Complete this cell with your code

# Calculate the count of occurrences for each name-year combination
df_names["count"] = df_names.groupby(["name", "year"])["name"].transform("count")

# Filter the DataFrame based on the count condition
filtered_df = df_names[df_names["count"] > 2000]

# Display the filtered DataFrame
filtered_df

:**TODO:** What if we want to select those names with more than 8 characters and from 2010 onwards?

In [None]:
# Complete this cell with your code

## Statistics

**TODO:** Obtain the mean value and standard deviation of each numeric column. Is there a function in Pandas that will give us even more statistics?

In [None]:
# Complete this cell with your code

## Delete a column

**TODO:** Delete the column `amount_chars` from the dataframe.

In [None]:
# Complete this cell with your code

## Sorting by column

**TODO:** Sort the dataframe by `amount` and descending

In [None]:
# Complete this cell with your code

## Pandas groupby and plot

**TODO:** Group the number of names by `year` and plot it using vertical bars

In [None]:
# Complete this cell with your code