# Exercise notebook :

In [None]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

import pandas as pd
from datetime import datetime

# Weather Analysis

You have learned some more about Python and the pandas module and tried it out on a
fairly small dataset. You are now ready to explore a dataset from the Weather
Underground.

## Weather Data

Will be looking at investigating historic weather data.
Of course, such data is hugely important for research into the large-scale, long-term shift
in our planet’s weather patterns and average temperatures – climate change. However,
such data is also incredibly useful for more mundane planning purposes. To demonstrate
the learning this week, we will be using historic weather data to try and plan a
summer holiday. You’ll use the data too and get a chance to work on your own
project at the end of the week.
The dataset we’ll use to do this will come from the [Weather Underground](http://www.wunderground.com/), which creates
weather forecasts from data sent to them by a worldwide network of over 100,000 weather
enthusiasts who have personal weather stations on their house or in their garden.
In addition to creating weather forecasts from that data, the Weather Underground also
keeps that data as historic weather records allowing members of the public to download
weather datasets for a particular time period and location. These datasets are
downloaded as CSV files, explained in the next step.
Datasets are rarely ‘clean’ and fit for purpose, so it will be necessary to clean up the data
and ‘mould it’ for your purposes. You will then learn how to visualise data by creating
graphs using the `plot()` function

We have downloaded the file London_2014.csv from our website, it can now be read into a dataframe.

In [None]:
london = pd.read_csv('London_2014.csv')
london.head()

`Note that the right hand side of the table has been cropped to fit on the page.
You’ll find out how to remove rogue spaces.`

### Removing initial spaces
One of the problems often encountered with CSV files is rogue spaces before or after data
values or column names.

You learned earlier, in What is a CSV file? , that each value or column name is separated
by a comma. However, if you opened ‘London_2014.csv’ in a text editor, you would see
that in the row of column names sometimes there are spaces after a comma:
    
`GMT,Max TemperatureC,Mean TemperatureC,Min TemperatureC,Dew PointC,
MeanDew PointC,Min DewpointC,Max Humidity, Mean Humidity, Min Humidity,
Max Sea Level PressurehPa, Mean Sea Level PressurehPa, Min Sea Level
PressurehPa, Max VisibilityKm, Mean VisibilityKm, Min VisibilitykM, Max Wind
SpeedKm/h, Mean Wind SpeedKm/h, Max Gust SpeedKm/h,Precipitationmm,
CloudCover, Events,WindDirDegrees`

For example, there is a space after the comma between Max Humidity and Mean
Humidity. This means that when read_csv() reads the row of column names it will
interpret a space after a comma as part of the next column name. So, for example, the
column name after `'Max Humidity'` will be interpreted as `' Mean Humidity'` rather
than what was intended, which is `'Mean Humidity'`. The ramification of this is that code
such as:
    
`london[['Mean Humidity']]`

will cause a key error (see Selecting a column ), as the column name is confusingly `'
Mean Humidity '`.

This can easily be rectified by adding another argument to the `read_csv()` function:
`skipinitialspace=True`
which will tell `read_csv()` to ignore any spaces after a comma:

There are too many columns for the dataframe to fit horizontally in this notebook, but they can be displayed separately.

In [None]:
london.columns

This shows that <code>' Max Wind SpeedKm/h'</code> is prefixed by a space, as are other columm names such as <code>' Mean Humidity'</code> and <code>' Max Sea Level PressurehPa'</code>.

The  <code>read_csv()</code> function has interpreted spaces after commas as being part of the next value. This can be rectified  easily by adding another argument to the <code>read_csv()</code> function to skip the initial spaces after a comma.

In [None]:
london = pd.read_csv('London_2014.csv', skipinitialspace=True)

### Removing extra characters

Another problem shown above is that the final column is called <code>'WindDirDegrees&lt; br /&gt;'</code>.

When the dataset was exported from the Weather Underground web site html line breaks were automatically added to each line in the file which <code>read_csv()</code> has interpreted as part of the column name and its values. This can be seen more clearly by looking at more values in the final column:

In fact, the problem is worse than this, let’s look at some values in the final column:

In [None]:
london['WindDirDegrees<br />'].head()

It’s seems there is an html line break at the end of each line. If I opened `‘London_2014.
csv’` in a text editor and looked at the ends of all lines in the file this would be confirmed.
Once again I’m not going to edit the CSV file but rather fix the problem in the dataframe.

To change `'WindDirDegrees
'` to `'WindDirDegrees'` all I have to do is use the `rename()` method as follows:

In [None]:
london = london.rename(columns={'WindDirDegrees<br />' : 'WindDirDegrees'})

Don’t worry about the syntax of the argument for `rename()` , just use this example as a
template for whenever you need to change the name of a column.

Now I need to get rid of those pesky html line breaks from the ends of the values in the `'WindDirDegrees'` column, so that
they become something sensible. I can do that using the `string method rstrip()` which
is used to remove characters from the `end or ‘rear’` of a string, just like this:

In [None]:
london['WindDirDegrees'] = london['WindDirDegrees'].str.rstrip('<br />')

Again don’t worry too much about the syntax of the code and simply use it as a template
for whenever you need to process a whole column of values stripping characters from the
end of each string value.
Let’s display the first few rows of the `' WindDirDegrees'`to confirm the changes:

In [None]:
london['WindDirDegrees'].head()

### Missing values

Missing (also called null or not available) values are marked as NaN (not a number) in dataframes, these are one of the reasons to clean data.

The `isnull()` method returns `True` for each row in a column that has a null value. The method can be used to select and display those rows. Scroll the table below to the right to check that the events column is only showing missing values.

Finding missing values in a particular column can be done with the column method
isnull() , like this:

In [None]:
london[london['Events'].isnull()]

The above code returns a series of Boolean values, where `True` indicates that the
corresponding row in the `'Events'` column is missing a value and `False` indicates the
presence of a value.

One way to deal with missing values is to replace them by some value. The column method `fillna()` fills all not available value cells with the value given as argument. In the example below, each missing event is replaced by the empty string.

If, as you did with the comparison expressions, you put this code within square brackets
after the dataframe’s name, it will return a new dataframe consisting of all the rows without
recorded events **(rain, fog, thunderstorm, etc.):**

In [None]:
london[london['Events'].isnull()]

This will return a new dataframe with 114 rows, showing that more than one in three days had no particular event recorded.
If you scroll the table to the right, you will see that all values in the `'Events'` column are
marked `NaN` , which stands for `‘Not a Number’`, but is also used to mark non-numeric
missing values, like in this case (events are strings, not numbers).

Once you know how much and where data is missing, you have to decide what to do:
    
- ignore those rows? 
- Replace with a fixed value? 
- Replace with a computed value, like the mean?

In this case, only the first two options are possible. The method call `london.dropna()`
will drop (remove) all rows that have a missing (non-available) value somewhere,
returning a new dataframe. This will therefore also remove rows that have missing values
in other columns.
The column method `fillna()` will replace all non-available values with the value given
as argument. For this case, each NaN could be replaced by the empty string.

In [None]:
london['Events'] = london['Events'].fillna('')
london[london['Events'].isnull()]

The second line above will now show an empty dataframe, because there are no longer
missing values in the events column.
As a final note on missing values, pandas ignores them when computing numeric
statistics, i.e. you don’t have to remove missing values before applying `sum(),
median()` and other similar methods.

The empty dataframe (no rows) confirms there are no more missing event values.

Another way to deal with missing values is to ignore rows with them. The `dropna()` dataframe method returns a new dataframe where all rows with at least one non-available value have been removed.

In [None]:
london.dropna()

Note that the table above has fewer than 251 of the original 365 rows, so there must be further null values besides the 114 missing events.

## Changing the value type of a column

The function `read_csv()` may, for many reasons, wrongly interpret the data type of the
values in a column, so when cleaning data it’s important to check the data types of each
column are what is expected, and if necessary change them.

The type of every column in a dataframe can be determined by looking at the dataframe's `dtypes` attribute, like this:

In [None]:
london.dtypes

In the above output, you can see the column names to the left and to the right the data
types of the values in those columns.
- **int64** is the pandas data type for whole numbers such as `55 or 2356`
- **float64** is the pandas data type for decimal numbers such as `55.25 or 2356.00`
- **object** is the pandas data type for strings such as 'hello world' or 'rain'
Most of the column data types seem fine, however two are of concern, `'GMT'` and
`'WindDirDegrees'` , both of which are of `type object`. Let’s take a look at
`'WindDirDegrees'` first.

**Changing the data type of the `'WindDirDegrees'` column**

The `read_csv()` method has interpreted the values in the `'WindDirDegrees'` column
as strings `(type object )`. This is because in the CSV file the values in that column had all
been suffixed with that html line break string
so `read_csv()` had no alternative but to interpret the values as strings.
The values in the `'WindDirDegrees'` column are meant to represent wind direction in
terms of `degrees from true north (360) and meteorologists always define the wind
direction as the direction the wind is coming from`. So if you stand so that the wind is
blowing directly into your face, the direction you are facing names the wind, so a westerly
wind is reported as 270 degrees. The compass rose shown below should make this
clearer:

We need to be able to make queries such as ‘Get and display the rows where the wind
direction is greater than 350 degrees’. To do this we need to change the data type of the
`‘WindDirDegrees’` column from object to `type int64`. 
The type of all the values in a column can be changed using the <code>astype()</code> method. The following code will change the values in the <code>'WindDirDegrees'</code> column from strings (`object`) to integers (<code>int64</code>).

In [None]:
london['WindDirDegrees'] = london['WindDirDegrees'].astype('int64')   

Now all the values in the `'WindDirDegrees'` column are of `type int64` and we can
make our query:

In [None]:
london[london['WindDirDegrees'] > 350]

**Changing the data type of the ‘GMT’ column**

Recall that I noted that the `'GMT'` column was of type object , the type pandas uses for
strings.

The `'GMT'` column is supposed to represent dates. It would be helpful for the date values
not to be strings to make it possible to make queries of the data such as `‘Return the row
where the date is 4 June 2014’`.

Pandas has a function called `to_datetime()` which can convert a column of `object
(string)` values such as those in the `'GMT'` column into values of a proper date type called
`datetime64`, just like this:

In [None]:
london['GMT'] = pd.to_datetime(london['GMT'])
london.dtypes

From the above output, we can confirm that the `'WindDirDegrees'` column type has
been changed from `object to int64` and that the `'GMT'` column type has been changed
from `object to datetime64`.

To make queries such as `‘Return the row where the date is 4 June 2014’` you’ll need to be
able to create a `datetime64 value to represent June 4 2014`. It cannot be:
`london[london['GMT'] == '2014-1-3']`
because `‘2014-1-3’` is a string and the values in the `‘GMT’` column are of type
`datetime64`. Instead you must create a `datetime64 value using thedatetime()`
function like this:
    
`datetime(2014, 6, 4)`

In the function call above, the first integer argument is the year, the second the month and
the third the day.

Let’s try the function out by executing the code to `‘Return the row where the date is 4
June 2014’`:

In [None]:
london[london['GMT'] == datetime(2014, 6, 4)] 

You can also now make more complex queries involving dates such as 'Return all the rows where the date is between 8 December and 12 December' can be made:

In [None]:
dates = london['GMT']
start = datetime(2014, 12, 8)
end = datetime(2014, 12, 12)
london[(dates >= start) & (dates <= end)]

### Tasks

Now that the wind direction is given by a number, write code to select all days that had a northerly wind. Hint: select the rows where the direction is greater than or equal to 350 **or** smaller than or equal to 10, as the compass rose shows.

In the code cell below, write code to get and display all the rows in the dataframe that are beween 1 April 2014 and 
11 April 2014.

In the cell below, write two lines of code to display the first five rows that have a missing value in the `'Max Gust SpeedKm/h'` column. Hint: first select the missing value rows and store them in a new dataframe, then display the first five rows of the new dataframe.