<a href="https://colab.research.google.com/github/bjentwistle/PythonFundamentals/blob/main/Learn_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Python code to change the data for plotting
---

Let's use a practice data set to learn

Now you have visually checked your dataset, you should be familiar with it and have an idea of what you could do with the data.

Let’s use Python Pandas to work with the data.

## Step 1 - Link to the dataset 
This instruction to link to a dataset is slightly different depending on the type of file you are connecting to.  The datasets we are working with are **.csv** files (comma-separated values).

In the code box below, you already have a variable set up called *url* with a link to the dataset (NO2 Practice Data).  This has a small set of data for you to practice with.

Your code will need to:

*  import the pandas library (`import pandas as pd`)
*  store the link to the data file (`url = "...."`)
*  create a dataframe (a data table) called df (`df = pd.read_csv(url)` )

**Run the code cell below by clicking on the arrow in the top left of the cell.**
 


In [1]:
import pandas as pd

url = "https://drive.google.com/uc?id=1_arR3BXakHSMSUkBfPkRleXBX1HZcJ5F"
df = pd.read_csv(url, skiprows=1)
print(df)

          Date      Time    Day NO2 Level
0   01/01/2020   8:00:00    Mon      28.5
1   02/01/2020   8:00:00   Tues      17.5
2   03/01/2020   8:00:00    Wed      23.5
3   04/01/2020   8:00:00  Thurs     28.23
4   05/01/2020   8:00:00    Fri     24.53
..         ...       ...    ...       ...
86  27/03/2020   8:00:00    Wed      28.3
87  28/03/2020   8:00:00  Thurs     19.45
88  29/03/2020   8:00:00    Fri     21.04
89  30/03/2020  12:00:00    Sat      4.23
90  31/03/2020  12:00:00    Sun      1.59

[91 rows x 4 columns]


## Step 2 - Check the contents of your dataset

Now you are linked to your dataset, there are a few different ways you could check your data:
 
`display (df)` - displays the dataset (if dataset is large then only first and last 5 rows are shown).  

Check that the data has been read correctly:  

*  are there the same number of rows?  (0 to 90)  
*  are the columns the same as in the orginal file? (Date, Time, Day, NO2 Level)  
*  does it have any missing values?  

Run the code cell below to look at the data.  

In [2]:
display (df)

Unnamed: 0,Date,Time,Day,NO2 Level
0,01/01/2020,8:00:00,Mon,28.5
1,02/01/2020,8:00:00,Tues,17.5
2,03/01/2020,8:00:00,Wed,23.5
3,04/01/2020,8:00:00,Thurs,28.23
4,05/01/2020,8:00:00,Fri,24.53
...,...,...,...,...
86,27/03/2020,8:00:00,Wed,28.3
87,28/03/2020,8:00:00,Thurs,19.45
88,29/03/2020,8:00:00,Fri,21.04
89,30/03/2020,12:00:00,Sat,4.23


## Step 3 - Check the structure of the dataset

`display (df.info())` - displays the structure of your dataset
 
You should :  
*  check what types of data are in the dataset (whole numbers - int, decimal numbers - float, strings, dates, etc).  **dtypes** is short for datatypes and `df.info()` will show you what type of data is in each column.

A **.csv** file will automatically store everything as text (shown as an object dtype), is this right for data that is a date or a number?   

Run the code cell below to see information about the dataset.  You will see that all three columns are 'object' and in this case, this means that they are strings (text)

In [3]:
display (df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91 entries, 0 to 90
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Date       91 non-null     object
 1   Time       91 non-null     object
 2   Day        91 non-null     object
 3   NO2 Level  91 non-null     object
dtypes: object(4)
memory usage: 3.0+ KB


None

## Step 3 - Drop missing values

Sometimes we might substitute missing values, but only if it is relevant to do so.  

To drop values from the results use the following:  
`df.drop(df[df['column heading'] == 'value to drop'].index, inplace = True)`

This says:  

`df.drop(...)` --> drop all rows that fit the criteria in the brackets  

`df[df['column heading'] == 'value to indicate dropping'].index` --> look for all rows in the 'column heading' column that contain the null value and get the index  

`inplace = True` --> leave the rest of the dataframe (df) as it was

Run the cell below to drop all the rows where the data in the column heading '*NO2*' is '*nodata*'  (replace the red writing)


In [5]:
df.drop(df[df['NO2 Level'] == 'nodata'].index, inplace = True)

## Step 4 - Change the date type:   

There are two columns where the data is text but in order to be able to create useful charts, it will need to be numbers or date.

The 'Date' column can be changed from text to a date using the instruction below:  

`df['Date'] = pd.to_datetime(df['Date'])`  

This converts the date from text (string) to a datetime record (which contains both date and time) and stores the new datetime back in the Data column.

Run the code cell below to convert the date, and then to see the new dtype in using df.info()


In [6]:
df['Date'] = pd.to_datetime(df['Date'])
display(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90 entries, 0 to 90
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       90 non-null     datetime64[ns]
 1   Time       90 non-null     object        
 2   Day        90 non-null     object        
 3   NO2 Level  90 non-null     object        
dtypes: datetime64[ns](1), object(3)
memory usage: 6.0+ KB


None

## Step 5 - NO2 Level values to decimal numbers (float)

The 'NO2 Level' column can be changed from text to a date using the instruction below:   

`df['NO2 Level'] = df['NO2 Level'].astype(float)` 

This converts the date from text (string) to a datetime record (which contains both date and time) and stores the new datetime back in the Data column.

Edit and run the code cell below to convert the date.  
Add the `df.info()` instruction to see the new dtype 

In [9]:
df['NO2 Level'] = df['NO2 Level'].astype(float)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90 entries, 0 to 90
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       90 non-null     datetime64[ns]
 1   Time       90 non-null     object        
 2   Day        90 non-null     object        
 3   NO2 Level  90 non-null     float64       
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 6.0+ KB


# Summary
---

Having worked through this and should be familiar with running code in a code cell.  You may also have learnt what happens when you have a mistake in your code (red lines and/or error messages).

You will need most of the code above when you are making charts in the next worksheet.  Leave this worksheet open so that you can refer back, and copy instructions to other worksheets.