# pandas

## <br>Basic plotting

Pandas allows you to make some basic plots without loading other packages. Plotting in pandas is good for exploring your data, and we'll focus on that today.
<br><br>Pandas is not good for making good-looking, high-quality data visualizations. The Python library for that is called matplotlib. We will not be covering how to make data visualizations for publication, since this is not a matplotlib workshop.
<br><br>Pandas' plotting capabilities are actually built on matplotlib, but in a much simpler format.

In [None]:
import pandas as pd

We're going to work with our forest fire dataset again.

#### If you are using Google Colab, you must run the next line of code. *If you are NOT using Google Colab, do NOT run the next line.*

In [None]:
!wget https://raw.githubusercontent.com/aGitHasNoName/pandasBasics/main/forestfires.csv

<br><br>Everyone can now run the next line of code to create a DataFrame from our csv file.

In [None]:
df = pd.read_csv("forestfires.csv")

In [None]:
df.head()

### <br><br>Histograms

One of the most common data exploration tasks you might do is to check the distributions of the columns in your dataset.

We will use the `hist()` method function on the `temp` column.

In [None]:
df["temp"].hist()

<br><br>Like I said, it's not pretty, but it tells us the story of our data.

By default, `hist()` will divide the data into 10 bins. We can change that by passing a keyword argument:

In [None]:
df["temp"].hist(bins=20)

### <br><br>Exercise 1

Make a histogram of the `humidity` column. Specify that you want the data grouped into 15 bins.

<br><br><br>Another way to create a histogram in pandas is to do:

In [None]:
df.hist(column="temp")

<br>Both ways are doing the same thing. In one, we're making a histogram on a sample of our dataframe. In the second, we're calling `hist()` on the entire dataframe and then specifying the column with an argument.

<br>If we don't specify a column, we will get histograms of all columns with numerical data:

In [None]:
df.hist()

<br>We can also ask for a list of columns:

In [None]:
df.hist(column=["temp", "humidity"])

### <br><br>Exercise 2

In one line of code, create histograms for the `moisture_code` and `drought_code` columns.

**<br><br><br>What other changes can we make to our histogram?**

Let's look at the documentation. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html

<br>We can get rid of the grid lines!

In [None]:
df.hist(column="temp", grid=False)

<br>We can also change the figure size. The `figsize` keyword argument takes a list containing two numbers: width in inches and height in inches.

### <br><br>Exercise 3

Remember this plot?

In [None]:
df.hist()

<br>Change the figure size in the next line of code until you can see the plots better:

In [None]:
df.hist(grid=False, figsize=[2,2])

## <br><br><br>Scatter plots to check for correlation

We can make a quick scatter plot to check for correlation between 2 columns. We have to use a slightly different format for our scatter plot function. We are going to do `df.plot.scatter()`. This function requires two arguments, the columns for your x and y axes.

In [None]:
df.plot.scatter(x="temp", y="humidity")

### <br><br>Exercise 4

Write code to create a scatter plot with the moisture_code column on the x-axis and the drought_code column on the y-axis.

Based on what we learned with `hist()`, can you add a grid to the scatter plot you just made and change the size so that it is a perfect square?

## <br><br><br>Slightly more complicated examples

### <br>Removing Outliers

Let's look at the correlation between humidity and area_burned:

In [None]:
df.plot.scatter(x="humidity", y="area_burned")

<br>We can see that a few outliers are clouding any relationship. We can remove them with a boolean, but we need to look at the plot above and decide where to make the cutoff. Let's try getting rid of only points above 400.

In [None]:
df_no_outliers = df[df["area_burned"] < 400]

In [None]:
df_no_outliers.plot.scatter(x="humidity", y="area_burned")

<br><br>There are a large number of points with 0 for area_burned. Let's remove those, and only look at days where the fires spread.

In [None]:
df_no_outliers = df[(df["area_burned"] > 0) & (df["area_burned"] < 400)]

In [None]:
df_no_outliers.plot.scatter(x="humidity", y="area_burned")

### <br><br><br>Plotting with categorical data

Let's say we want to see the relationship between month and humidity. We can try a scatter plot:

In [None]:
df.plot.scatter(x="month", y="humidity")

<br>That doesn't work! Let's try another type of plot - a bar plot:

In [None]:
df.plot.bar(x="month", y="humidity")

<br>Also not what we're looking for. It's plotting each data point individually.

What we really want to see is how the mean humidity changes each month. We can create a new DataFrame that only includes the data we need. We will group by month, select only the humidity column, and then find the mean.

In [None]:
hum_mean = df.groupby("month")["humidity"].mean()
hum_mean

<br>Now we have a nice series object that we can plot:

In [None]:
hum_mean.plot.bar()

<br>This is looking better, but we need to sort them by month, not alphabetically. We're going to worry about that in just a minute.

### <br><br>Exercise 5

Create a bar graph that shows mean temperature grouped by month. First you'll need to create a new series object with the means. Refer back to the humidity exercise directly above.

In [None]:
temp_mean = 

<br><br><br>Let's deal with the sorting issue! We can create a new column in our DataFrame that contains a numerical value for the month. We will use a handy pandas function called `replace()`. It takes two arguments: a list of items to replace, and a list of the replacement values in the same order:

In [None]:
df["month_num"] = df["month"].replace(["jan", "feb", "mar", "apr", 
                          "may", "jun", "jul", "aug", 
                          "sep", "oct", "nov", "dec"], 
                         [1, 2, 3, 4, 5, 6, 7, 8, 9, 
                          10, 11, 12])
df.head()

<br>Now we will repeat the humidity plot we just did, but we will group by the new column.

In [None]:
hum_mean = df.groupby("month_num")["humidity"].mean()
hum_mean.plot.bar()

## <br><br><br>Other packages for exploring data

There are a few packages that have been created to help you visualize your data during the data exploration step without writing the code for yourself. They also have their limitations, however.

- pandas-profiling (gives in-depth summaries about your dataset) (I've been having trouble with the installation/dependencies lately)
- sweetviz (designed as exploration prior to machine learning, so it requires you to split your data into train and test and select a target variable)

Both packages cover:
- data type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Histograms
- Correlations of variables
- Missing values matrix, count, etc.