# Chapter III - Pandas (Data Analysis)

In this notebook we will cover how to leverage other inbuilt methods of `pandas`, to perform:
- statistical description
- data analysis
- plotting (hist/scatter/boxplot)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

Let's resume things from where we've left them. We load back the dataframe that we had exported at the end of the previous notebook.

In [None]:
df = pd.read_csv('df_museum_tweets.csv')

## Section (1): The `.describe()` method

The `.describe()` method is a very handy tool, that provides summary statistics at a glance:

In [None]:
df.describe()

❓ Why are some columns missing in the summary table?

- To include all columns, you may use:

In [None]:
df.describe(include='all')

- ❓ How would you access the descriptive statistics of _one_ column?
- ❓ What is the datatype of such stats for _one_ column?

In [None]:
# your answer here:

## Section (2): `pandas` plotting

A properly structured `DataFrame`, with adequate datatypes, makes plotting really easy and fast.

There are two ways to tackle plotting the data in a dataframe column:

- (1) Extract the column and plot it using embedded methods.
    ```python
        df['ColumnToPlot'].plot()
    ```
- (2) Plot the dataframe using embedded methods, specifying in the arguments which columns you want to represent.
    ```python
        df.plot(column='ColumnToPlot')
    ```
These two possibilities are equivalent most of the time. (As your plot becomes more complex, you might however want to prefer the second option which allows to leverage other columns as well.)

❗ Generally, unless all your columns are numerical values of comparable meaning, you will need to specify which kind of plot you want: 
- bar
- scatter
- hist
- box
- etc.

<img src='../data//pandas_plot.png' width='1200px'>

### A) Basic plotting

As a first example, to plot a histogram of the number of words in the tweets, you can use:
- method (1)
    ``` python
    df['tweet_nbwords'].plot.hist()
    ```
- method (2)
    ``` python
    df.plot.hist(column='tweet_nbwords')
    ```

Try both methods! 

- ❓ Where would you specify the number of bins, by passing the argument `bins=100` in the method?
- ❓ What if you wanted to do a scatter plot? For example comparing the number of words to the number of mentions? You might need the arguments `x=` and `y=`.

In [None]:
# Your answer here:

✏️ [Ex.1] Using different elements from past notebooks:
- Convert the "created_at" column into a `pd.datetime` datatype
- Produce a scatter plot of the tweet length vs. date
- Plot a histogram of the tweet length.
- Use the graph to investigate which tweets are the longest, and why?
- ❓ What about the 140-character limit? Is this applicable here? Why not?


In [None]:
# your answer here:

---

### B) Multi-variable plotting

Like the `.groupby()` logic, you might want to compare at once the evolution of one variable accoring to another one. This can be easily done in `pandas`.

For example, to plot a histogram of the hour at which tweets are published is done by:
    
```python
    df.hist(column='hour')
```

In [None]:
df.hist(column='hour')

But if you modify the arguments of method, and specify the `by=` parameter:
    
```python
    df.hist(column='hour', by='day')
```

You plot as many histograms as there are values of `day`.

In [None]:
df.hist(column='hour', by='day')

### C) Boxplots

For scenarios which require a finer statistical analysis of the data, boxplots are often a good choice as they represent:
- median
- 25th and 75th percentiles
- upper and lower whisker ends
- outliers

To show the boxplot (or whisker plot) of one variable, you can do the following:
```python
    df[['Variable']].boxplot()
```

In [None]:
df[['tweet_length']].boxplot()

But you can have that same multi-variable logic, and thus get as many whisker plots of variable A as there values of variable B.

For example, if we want to show the statistics of the tweet length by day of the week, we can do:
```python
    df[['tweet_length', 'day']].boxplot(by='day')
```

In [None]:
df[['tweet_length', 'day']].boxplot(by='day')