# Week 7: Spreadsheets, Introduction to Pandas, Working with the NYT Best Seller list



But first, CSV files ...

Here are the first four lines of ttr-standardized.csv from our last lecture:

```
Text,Types,Tokens,TTR
1897-McKinley,290,560,51.79
1901-McKinley,303,560,54.11
1905-Roosevelt,250,560,44.64
```

In [None]:
ttr_file = open("ttr-standardized.csv", encoding="utf-8")

ttr_file.readline()

for line in ttr_file.readlines():
    line = line.strip()
    columns = line.split(",")
    print(f"{columns[0]} has a TTR of {columns[-1]}")

# What Is (Are?) Pandas üêºüêº

* A python library for working with tabular data
* The most-used library for data science in Python

In [None]:
import pandas as pd

In [None]:
# reading csv files is way easier in pandas!
nyt_df = pd.read_csv('nyt_full.tsv', sep="\t")

Pandas creates a new **data type**: a **DataFrame**, 

In [None]:
type(nyt_df)

Let's have a look at what's inside...

In [None]:
nyt_df

# Display the First `x` Rows

If you just want to look at the first `x` number of rows, you can use the `.head()` method, as below.

In [None]:
nyt_df.head(10)

# Display a Random Sample

If you want to look at a random sample of `x` rows, you can use the `.sample()` method.

In [None]:
nyt_df.sample(10)

# Get Info

If you'd like to see basic information about what's in your DataFrame, you can use the `.info()` method.

In [None]:
nyt_df.info()

# Calculate Summary Statistics

In [None]:
nyt_df.describe(include="all")

## Digression: How we can describe and summarise (numeric) data statistically? (in 10 min)

- Central tendency measures (and scales/levels of measurement): mean, median, mode
- Spread measures (standard deviation, percentiles & percentile ranges) <- not today, but important
- Key abstractions of stats: probability distribution and data generation process <- if have time


Let's look at an example. Suppose we have salaries for a two different job categories: A and B. Can we find out something about the salaries?  To do that we will talk about mean, median, and mode.

* Mean - arithmetic mean -- aka average
    * Add up all the numbers and divide by the number of elements
* Median - sort the data, choose the value in the center
* Mode - which value is the most frequent


In [None]:
# a library that gives us some useful stats functions
import statistics

salaries_A = [75, 78, 75, 80, 75, 80, 75, 85, 75, 70]
salaries_B = [55, 60, 55, 45, 50, 55, 100, 85, 45, 55]

# First let's combine the two lists and look into all the salaries
all_salaries = salaries_A + salaries_B

print(sorted(all_salaries))

print(f"Mean salary is {statistics.mean(all_salaries)}")
print(f"Median salary is {statistics.median(all_salaries)}")
print(f"The mode is {statistics.mode(all_salaries)}")


Now let's look at each salary category separately.  Which one do you thing will have a higher mean?

In [None]:
print(f"Category A: {sorted(salaries_A)}")
print(f"Mean salary is {statistics.mean(salaries_A)}")
print(f"Median salary is {statistics.median(salaries_A)}")
print(f"The mode is {statistics.mode(salaries_A)}")

print("")
print(f"Category B: {sorted(salaries_B)}")
print(f"Mean salary is {statistics.mean(salaries_B)}")
print(f"Median salary is {statistics.median(salaries_B)}")
print(f"The mode is {statistics.mode(salaries_B)}")


# Selecting Columns

To select only a single column of a DataFrame: name of the Pandas DataFrame, then a `[`,  name of the column between quotation marks, then a `]`.

In [None]:
nyt_df['author']

This output is another special Pandas data type: a **`Series`.** You can think of a Pandas Series as like a spreadsheet with a single column... or as something very much like a **`list`** in Python.

In [None]:
type(nyt_df['author'])

What if we want a dataframe and not a series?

Use a list of of column names bewteen `[]`

In [None]:
nyt_df[['author']]

In [None]:
type(nyt_df[['author']])

We can select **multiple** columns, and display them as a DataFrame, by again passing in a **list of strings**, each corresponding to a column name. 

In [None]:
nyt_df[['week', 'rank', 'author', 'title']]

# Counting Values

Let's do some fun stuff with Pandas!!! 

The `.value_counts()` method counts the number of **unique items** in a particular column.

**What exactly is this showing us**?

In [None]:
nyt_df['title'].value_counts()

This does the same for the "author" column. **What exactly is it showing us**?

In [None]:
nyt_df['author'].value_counts()

The outputs above are those `Series` objects again. I mentioned above that `Series` are a lot like `list`s, and indeed we can slice them just like `list`s if we want to see a particular number of values...

In [None]:
nyt_df['author'].value_counts()[:20]

Let's take a quick step back and review how we can chain these together. 

In [None]:
# In one line:
top_20_authors = nyt_df['author'].value_counts()[:20]

# select the author column as a Series


# create a series that counts the number of times each author appears


# slice the series to get the top 20.

top_20_authors

In [None]:
# Which titles have been on the NYT best sellers list the most?
nyt_df['title'].value_counts()[:20]

The below line of code contains the average number of times that a given NYT Best Seller appears in the list. Can you find what that number is? Can you explain how this line of code works? Do you understand why we've stacked `.value_counts()` and `.describe()`?

In [None]:
nyt_df['title'].value_counts().describe()

# Make and Save Plots

Pandas is also very handy for making **plots**, aka visualizations of data. All you need to do is add the `.plot()` and some parameters. Here's the simplest `.plot()` command I can think of, which specifies that we want a **bar plot**.

The types of plots in Pandas are are:

- `bar` or `barh` for bar plots
- `hist` for histogram
- `box` for boxplot
- `kde` or `density` for density plots
- `area` for area plots
- `scatter` for scatter plots
- `hexbin` for hexagonal bin plots
- `pie` for pie plots

In [None]:
all_titles = nyt_df['title'].value_counts()
titles_plot = all_titles.plot(kind="bar")

titles_plot.tick_params(
    axis='x',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom=False,      # ticks along the bottom edge are off
    top=False,         # ticks along the top edge are off
    labelbottom=False  # labels along the bottom edge are off
)
titles_plot.axvline(x=7144*.25, color='red', linestyle='--', linewidth=1, label='25%')
titles_plot.axvline(x=7144*.5, color='red', linestyle='--', linewidth=1, label='50%')
titles_plot.axvline(x=7144*.75, color='red', linestyle='--', linewidth=1, label='75%')
titles_plot.axhline(y=8.45, color='green', linestyle='--', linewidth=1, label='mean')




In [None]:
top_ten_authors = nyt_df['author'].value_counts()[:10]
print(top_ten_authors.plot(kind="bar"))

### Now, we're going to immediately get into the habit of **tucking our plots into variables**. 

### This is how we're asking you to make a plot in your homework, so take note!!


Okay, let's add a title to that plot, using the `title` parameter (and a `\n` "newline" character!)

In [None]:
top_ten_authors = nyt_df['author'].value_counts()[:10]
plot = top_ten_authors.plot(kind="bar", title='NYT Best Sellers:\nTen Authors Who Appear Most Frequently')

print(plot)

And let's try making two different kinds of plots.

First, a `barh` or **horizontal bar**...

In [None]:
top_ten_authors = nyt_df['author'].value_counts()[:10]
plot = top_ten_authors.plot(kind="barh", title='NYT Best Sellers:\nTen Authors Who Appear Most Frequently')
print(plot)

Here is a `pie` plot ‚Äî which could potentially be misconstrued in this context! **Why is this a potentially misleading plot?**

In [None]:
top_ten_authors = nyt_df['author'].value_counts()[:10]
plot = top_ten_authors.plot(kind="pie", title='NYT Best Sellers:\nTen Authors Who Appear Most Frequently')
print(plot)

Now, if we wanted to **save** our pretty plot as a file, we could do so by applying the `.figure.savefig()` method to the variable containing our plot, and providing a filename as the argument.

In [None]:
plot = nyt_df['author'].value_counts()[:10].plot(kind='bar', title='NYT Best Sellers:\nTen Authors Who Appear Most Frequently')
plot.figure.savefig('NYT-top10authors-barchart.png')

# Filtering Data

Let's say we wanted to produce a DataFrame object that ONLY included rows in which Toni Morrison is the author. We could do so with the following line of code:

In [None]:
nyt_df[nyt_df['author'] == 'Toni Morrison']

Let's dig into this a bit more.  We want a boolean expression to select elements of a column with a particular value. Here is such an expression.  But note that it creates a boolean Series.

In [None]:
nyt_df['author'] == 'Toni Morrison'

In [None]:
# Let's put the series into its own variable 

morrison_filter = nyt_df['author'] == 'Toni Morrison'
type(morrison_filter)

Pandas has built-in functionality whereby, if "subset" or "slice" a DataFrame with a **boolean Series** of the same length as that DataFrame, it will produce a new DataFrame that only contains the rows marked `True` in the boolean Series.

The below line of code is absolutely equivalent to `nyt_df[nyt_df['author'] == 'Toni Morrison']` encountered earlier ‚Äî just a bit easier to read and understand!

In [None]:
nyt_df[morrison_filter]

Let's capture that output -‚Äî that 115 row x 6 column DataFrame -- in a variable. The below lines of code do exactly the same thing.

In [None]:
morrison_nyt = nyt_df[nyt_df['author'] == 'Toni Morrison']
morrison_nyt = nyt_df[morrison_filter]

Let's have a look at that variable...

In [None]:
morrison_nyt

Can you think of the line of code you would need to use to display all the rows in the dataset for novels titled "BELOVED"? (As you've noticed, book titles in this dataset are in ALL CAPS)

In [None]:
nyt_df[nyt_df['title'] == 'BELOVED']

How about a line of code that could create a new DataFrame that only contains titles that achieved the rank of #1 on the best seller list?

In [None]:
nyt_df[nyt_df['rank'] == 1]