# Python and Data
This assignment will introduce you to using python and pandas inside a jupyter notebook for data exploration and manipulation.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')

1. Write the required python code to load the data file `temperature.csv` into a pandas dataframe.

In [None]:
# import pandas as pd
df = pd.read_csv('temperature.csv')

2. Write the required python code display the first 10 values of your dataframe.

In [None]:
df.head(10)
# print(df[0:10])

3. Write the required python code to `describe` your dataframe, this will show the count, mean, standard deviation, minimum and maximum value, as well as the percentiles (0.25, 0.5, 0.75) for the temperature column.

In [None]:
cnt = len(df.temperature) 
mn = df.temperature.mean()
std = df.temperature.std()
low = df.temperature.min()
high = df.temperature.max()
q25 = df.temperature.quantile(q=0.25)
q5 = df.temperature.quantile(q=0.5)
q75 = df.temperature.quantile(q=0.75)
print("Number of data points: \t%s" %cnt)
print("Minumum temperature: \t%s" %low)
print("Maximum temperature: \t%s" %high)
print("Average temperature: \t%s +- %s" %(mn,std))
print("25th percentile: \t%s" %q25)
print("50th percentile: \t%s" %q5)
print("75th percentile: \t%s" %q75)

df.describe()

4. Write the required python code to rename the "temperature" column to "Tc". 

In [None]:
df.rename(columns={'temperature':'Tc'},inplace=True)
df.head()

5. Write the required python code to add a new column to the dataframe that represents the equivalent temperature value in fahrenheit, you will now three columns in your dataframe. One will represent the date, the celsius value (Tc) and fahrenheit (Tf).

In [None]:
df['Tf'] = df['Tc'] * (9/5)+32
df.head()

6. Given the values of temperature, what can you say about the likely source of this data? Use the cell below to enter your answer, this will be text (you will need to convert the cell from code to markdown).

In [None]:
df.date = pd.to_datetime(df.date)
plt.figure()
df.plot('date','Tf',style='o',ms=.01)
plt.xticks(rotation=90)
# plt.show()

# plt.figure()
# df.plot('date','Tc',style='o',ms=.01)
# plt.xticks(rotation=90)
plt.show()

 The given data must be for Death Valley, or another similarly tempered place.
The average temps in the summer are ~116 F, according to the link below.
https://tinyurl.com/y72h3but

7. Write the required python code to change the index for the dataframe from numerically indexed dataframe to one of date using the date column. Make sure you drop the date column once you have created new index.

In [None]:
df.set_index('date',inplace=True,drop=True)
# df.drop('date',axis=1,inplace=True)

In [None]:
# df.groupby('nodeid').head(10)
df.head(10)

8. Write the required python code that will calculate the range of dates that this data covers.

In [None]:
mindate = df.index.min()
maxdate = df.index.max()
print("Date Range: %s to %s" %(mindate,maxdate))

9. Write the required python code to describe (count, mean, standard deviation, minimum and maximum value, as well as the percentiles (0.25, 0.5, 0.75)) grouped by `nodeid`.

In [None]:
# x = df.sort_values(df.nodeid)
# x.head()
meanid = df.groupby('nodeid')['Tf'].mean()
# print(meanid)
cnt = len(df.groupby('nodeid')['Tf']) 
mn = df.groupby('nodeid')['Tf'].mean()
std = df.groupby('nodeid')['Tf'].std()
low = df.groupby('nodeid')['Tf'].min()
high = df.groupby('nodeid')['Tf'].max()
q25 = df.groupby('nodeid')['Tf'].quantile(q=0.25)
q5 = df.groupby('nodeid')['Tf'].quantile(q=0.5)
q75 = df.groupby('nodeid')['Tf'].quantile(q=0.75)
# print("Number of data points: \t%s" %cnt)
# print("Minumum temperature: \t%s" %low)
# print("Maximum temperature: \t%s" %high)
# print("Average temperature: \t%s +- %s" %(mn,std))
# print("25th percentile: \t%s" %q25)
# print("50th percentile: \t%s" %q5)
# print("75th percentile: \t%s" %q75)


df2 = pd.DataFrame([cnt,mn,std,low,high,q25,q5,q75], columns=["Count","Mean","Std","Low","High","q25","q50","q75"])

# cnt = len(df.temperature) 
# mn = df.temperature.mean()
# std = df.temperature.std()
# low = df.temperature.min()
# high = df.temperature.max()
# q25 = df.temperature.quantile(q=0.25)
# q5 = df.temperature.quantile(q=0.5)
# q75 = df.temperature.quantile(q=0.75)
# print("Number of data points: \t%s" %cnt)
# print("Minumum temperature: \t%s" %low)
# print("Maximum temperature: \t%s" %high)
# print("Average temperature: \t%s +- %s" %(mn,std))
# print("25th percentile: \t%s" %q25)
# print("50th percentile: \t%s" %q5)
# print("75th percentile: \t%s" %q75)


10. Pick a node (e.g. *nodeid* = 001e061146ba) and write the required python code to extract the data for that node from the larger dataframe, display the last 10 items.

In [None]:
df3 = df.loc[df['nodeid'] == "001e061146ba"]
df3.tail(10)

11.Write the required python code to resample the data of the single node returning the mean (average) for each day.

In [None]:
df3.groupby(df3.index).mean()

12. How many days worth of data is produced in the resampling? (you will need to convert the cell from code to markdown)

13. Write the required python code to resample the data of the single node returning the mean for each hour, display the first 24 values.

---
# Switching Dataset (London Air - www.londonair.org.uk)
Exploration of data that has potential problems.
BG1 == Barking and Dagenham - Rush Green
BQ7 == Bexley - Belvedere West
Date: January 01, 2017 - February 01, 2017

This data represents rainfall measurements at two locations within London and is measured in millimeters (mm).

I have **modified** this data for purpose of learning (that is I introduced errors).

14. Write the required python code to load the new dataset `london-rain.csv` for exploration, then using  `describe()` function what can you say about the data (enter your written answer in cell below your code, you will need to add a cell and change the cell type to markdown).

In [None]:
dflond = pd.read_csv('london-rain.csv')
dflond.describe()

In [None]:
dflond.head()

15. You should have noticed that there are no headers associated with the data, write the required python code to add appropriate headers to data (site, variable, date, value, units, type) and display the result of the `describe` function.

In [None]:
dflond.columns = ["site","variable","date","value","units","type"]
dflond.describe()

In [None]:
dflond.head()

16. What is the average value of rainfall (enter your written answer in cell below your code, you will need to change the cell type to markdown), does this value seem reasonable?

In [None]:
print(dflond.value.mean(), dflond.value.std())

The mean amount of rainfall is ~1.67 mm.
This is close to the average rainfall of London via https://tinyurl.com/y75fsucx
This website states London receives 583.6 per year on average. 
The calculated average would yield 626.25 mm per year, which is about one standard deviation away. 
Based on this infornation, this is a perfectly reasonable vcalculated mean.

17. What is the minimum value of rainfall (enter your written answer in cell below your code, you will need to change the cell type to markdown), does this value seem reasonable?

In [None]:
print(dflond.value.min())

No, it is physically impossible to rain -1 mm.

18. What is the maximum value of rainfall (enter your written answer in cell below your code, you will need to change the cell type to markdown), does this value seem reasonable?

In [None]:
print(dflond.value.max())

No, this value is ridiculous. 
If London received 2 meters of rainfall in one day, it would cause mass havoc.

19. Write the required python code to discover if there are any missing values in the rain dataset.

In [None]:
dflond.isnull().values.any()

20. How many rows have missing values? You will need two cells, one for required python code and a second one converted to markdown to write your answer.

In [None]:
dflond.isnull().sum()

There are three rows with NaN values.

21. Write the rquired python code to remove the rows with missing values.

In [None]:
dflond.dropna(inplace=True)
dflond.isnull().sum()

22. Write the required python code to convert the index from simple number to the date column, remove the date column.

In [None]:
dflond.set_index('date',inplace=True,drop=True)

In [None]:
dflond.head()

23. Write the required python code to write out new cleaned dataset with the file name "london-rain-clean.csv".

In [None]:
dflond.to_csv('london-rain-clean.csv')