# Exercise: Cherry Blossoms!

#### Summary

Once upon a time it was サクラ season, which meant the [cherry blossoms](https://en.wikipedia.org/wiki/Cherry_blossom) were in full bloom! This year they bloomed a little early and they've long sinced faded, so for today we'll stick with data-driven blossoms: http://atmenv.envi.osakafu-u.ac.jp/aono/kyophenotemp4/

#### Data Source(s)

Historical Series of Phenological data for Cherry Tree Flowering at Kyoto City
(and March Mean Temperature Reconstructions), http://atmenv.envi.osakafu-u.ac.jp/aono/kyophenotemp4/

#### Files

- KyotoFullFlower7.xls, "Full-flowering Dates of Prunus jamasakura in Kyoto City"

#### Skills

- Working with Excel files
- Ignoring the first few rows
- Replacing NaN values
- Counting and summarizing columns
- Replacing non-NaN values
- Extracting with strings
- Rolling means

# Read in `KyotoFullFlower7.xls`

Be sure to look at the first five rows.

In [None]:
import xlrd
import pandas as pd
df = pd.read_excel('KyotoFullFlower7.xls',skiprows = 25)
df.head(5)

### That... doesn't look right. Why not? 

Examine your column names, and maybe even open up the file in Excel.

### Read in the file correctly, and look at the first five rows

- TIP: The first year should be 801 AD, and it should not have any dates or anything.

### Look at the final five rows of the data

In [None]:
df.tail(5)

## Watching out for NaN values

Take a look at **Reference name**. Is there something you should set to be `NaN`? Use either of the two ways we have covered.

In [None]:
import numpy as np
df['Reference Name'].replace('-',np.nan, inplace = True)
df.head()

### Check that you have 827 values for "Full-flowering date (DOY)" and 825 for "Reference Name"

In [None]:
df['Full-flowering date'].count()

In [None]:
df['Reference Name'].count()

# Cleaning up our data

## What sources are the most common as a reference?

In [None]:
df['Reference Name'].value_counts()

## Filter the list to only include rows that have a `Full-flowering date (DOY)`

In [None]:
df[df['Full-flowering date (DOY)'].notnull()]['Reference Name'].value_counts()


## Make a histogram of the full-flowering date.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

df['Full-flowering date (DOY)'].hist()


## Make another histogram of the full-flowering date, but with 39 bins instead of 10

In [None]:
df['Full-flowering date (DOY)'].hist(bins=39)

## What's the average number of days it takes for the flowers to blossom? The max? Min? And how many records do we have?

Answer these with one line of code.

In [None]:
df['Full-flowering date (DOY)'].describe()

## What's the average number of days into the year cherry flowers normally blossomed before 1900?

In [None]:
df[df['AD']<1900]['Full-flowering date (DOY)'].mean()

## How about after 1900?

In [None]:
df[df['AD']>1900]['Full-flowering date (DOY)'].mean()

## How many times was our data from a title in Japanese poetry?

You'll need to read the documentation inside of the Excel file.

In [None]:
len(df[df['Data type code'] == 4])

## Actually, that looks terrible. Replace the "Source code" and "data type code" columns with the values they stand for.

In [None]:
Source_code = { 1: "Reported by Taguchi (1939), J. Marine Meteorol. Soc. (Umi to Sora), 19, 217-227",
 2: "Added by Sekiguchi (1969), Tokyo Geography Papers, 13, 175-190.",
 3: "Added by Aono and Omoto (1994), J. Agric. Meteorol., 49, 263-272.",
 4: "Added by Aono and Kazui (2008), Int. J. Climatol., 28, 905-914 (doi: 10.1002/joc.1594).",
 5: "Cherry phenological data, Added by Aono and Saito (2010), Int. J. Biometeorol., 54, 211-219.",
 6: "Added by Aono (2011), Time Studies, 4, 17-29. (in Japanese with English abstract)",
 7: "Added by Aono (2012), Chikyu Kankyo, 17, 21-29. (in Japanese)",
 8: "Found after the last publication of articles."}

data_types = {
    0: "modern times (full-bloom date since 1880s)",
    1: "diary description about full-bloom",
    2: "diary description about cherry blossom viewing party",
    3: "diary description about presents of cherry twigs from party participants",
    4: "title in Japanese poety",
    8: "Deduced from wisteria phenology, using the relation proposed by Aono and Saito (2010)",
    9: "Deduced from Japanese kerria phenology, using the relation proposed by Aono (2011)"
}

df['Source code'].replace(Source_code, inplace = True)
df['Data type code'].replace(data_types, inplace = True)
df.tail()

## Show only the years where our data was from a title in Japanese poetry

In [None]:
df[df['Data type code'] == 'title in Japanese poety']

In [None]:
df['Data type code'].value_counts()

## Graph the full-flowering date (DOY) over time

In [None]:
df.plot(x = 'AD', y = 'Full-flowering date (DOY)')

## 15. Smooth out the graph

It's so jagged! You can use `df.rolling` to calculate a rolling average.

The following code calculates a **10-year mean**, using the `AD` column as the anchor. If there aren't 20 samples to work with in a row, it'll accept down to 5. Neat, right?

(We're only looking at the final 5)

In [None]:
df.rolling(10, on='AD', min_periods=5).mean().tail()
# If this gives you an error you're using an old pandas version,
# so you can use df.set_index('AD').rolling(10, min_periods=5).mean().reset_index().tail()
# instead

Adjust the code above to compute **and graph** a 20-year rolling average for the entire dataset.

In [None]:
df.rolling(20, on='AD', min_periods=5).mean().plot(x = 'AD',y = 'Full-flowering date (DOY)')

# Adding a month column

### HOLD ON, time to learn something

**There are a few ways to do the next question**, but a couple popular methods will have pandas yell at you. You might want to try this new thing called `loc`! **It is used to update a column in a row based on a condition.**

```
df.loc[df.country == 'Angola', "continent"] = "Africa"
```

This updates the `continent` column to be `Africa` for every row where `df.country == 'Angola'`. You CANNOT do the following, which is probably what you've wanted to do:

```
df[df.country == 'Angola']['continent'] = 'Africa'
```

And now you know.

### Actually adding our column

Right now the "Full-flowering date" column is pretty rough. It uses numbers like '402' to mean "April 2nd" and "416" to mean "April 16th." Let's make a column to explain what month it happened in.

* Every row that happened in April should have 'April' in the `month` column.
* Every row that happened in March should have 'March' as the `month` column.
* Every row that happened in May should have 'May' as the `month` column.

In [None]:
df['month'] = df['Full-flowering date'].astype(str).str.extract('(\d\d?)(\d{2}).\d')[0]
df.loc[df.month == '3', 'month'] = "March"
df.loc[df.month == '4', 'month'] = "April"
df.loc[df.month == '5', 'month'] = "May"

### Using your new column, how many blossomings happened in each month?

In [None]:
df.tail()

### Graph how many blossomings happened in each month.

In [None]:
df['month'].value_counts()

## 19. Adding a day-of-month column

Now we're going to add a new column called **day-of-month** based on the full-flowering date.

- 402 means "April 2"
- 312 means "March 12"
- 511 means "May 11"

**We're only interested in the second part**. Previously I've had students convert them to integers to do this, but you know regular expressions!

- Tip: You won't be able to extract anything from a float, you'll need it to be a string
- Tip: There are two things that mean "talk about this column as a string," maybe you want me to talk about them?

In [None]:
df['day-of-month'] = df['Full-flowering date'].astype(str).str.extract('(\d\d?)(\d{2}).\d')[1]
df.tail()

Now that you've successfully extracted the last two letters, save them into a new column called `'day-of-month'`

### 20. Adding a date column

Now take the `'month'` and `'day-of-month'` columns and combine them in order to create a new column called `'date'`. If should look like "April 09".