## Importing data
While you can create a dataframe in Python from scratch, it is much more common to import data from a file. There are many common file types for storing data, but the one we will focus on in this unit is the Comma Separated Value (CSV) file. These can be imported into a dataframe using the <code>read_csv</code> function. An example of this is shown below (note: for this to work, you will need the file 'weather.csv' to be uploaded to the same folder as your JupyterLab notebook).

In [1]:
import pandas as pd
weather = pd.read_csv('BrisbaneDailyWeather.csv')
weather

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall
0,2022/02/13,18.6,29.3,7.2
1,2022/02/12,20.4,28.9,0.0
2,2022/02/11,19.1,31.3,0.0
3,2022/02/10,19.4,31.2,0.0
4,2022/02/09,18.6,30.0,0.0
...,...,...,...,...
8096,1999/12/15,17.0,27.0,0.0
8097,1999/12/14,17.0,26.0,0.2
8098,1999/12/13,19.0,24.0,0.8
8099,1999/12/12,18.0,29.0,37.0


The <code>read_csv</code> function has the one required input which is the filepath of the file you wish to import. If you only provide this argument (like the function above) then Pandas will automatically make many decisions/inferences for you such as the dataframe's index, the data type of each column and column name allocations. You can utilise default arguments when you need more control over these processes - these can be seen at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html. For example, when importing the pollution dataset, we may want to specify that the date column should be the dataframe index, and that the index should have a datetime format. This can be done with the code below.

In [2]:
pollution = pd.read_csv('LSTM-Multivariate_pollution.csv', index_col = 'date', parse_dates = True)
pollution

Unnamed: 0_level_0,pollution,dew,temp,press,wnd_dir,wnd_spd,snow,rain
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2/01/2010 0:00,129,-16,-4.0,1020.0,SE,1.79,0,0
2/01/2010 1:00,148,-15,-4.0,1020.0,SE,2.68,0,0
2/01/2010 2:00,159,-11,-5.0,1021.0,SE,3.57,0,0
2/01/2010 3:00,181,-7,-5.0,1022.0,SE,5.36,1,0
2/01/2010 4:00,138,-7,-5.0,1022.0,SE,6.25,2,0
...,...,...,...,...,...,...,...,...
31/12/2014 19:00,8,-23,-2.0,1034.0,NW,231.97,0,0
31/12/2014 20:00,10,-22,-3.0,1034.0,NW,237.78,0,0
31/12/2014 21:00,10,-22,-3.0,1034.0,NW,242.70,0,0
31/12/2014 22:00,8,-22,-4.0,1034.0,NW,246.72,0,0


***
## Inspecting data
Datasets are often very large (10000+ observations!). You need code to facilitate data inspection rather than looking through the entire dataset manually. These are summarised in the table below, noting that df represents the name of your dataframe.

| Method/attribute |	Description |
| ----------- | ----------- |
| df.head(n = 5) |	View the top n rows of the dataframe. |
| df.shape |	Number of rows and columns. |
| df.columns |	Column labels. |
| df.dtypes |	Data type of each column. |
| df.index |	Row labels. |
| df.info() |	Summary of key information for dataframe. |
| df.describe() |	Summary statistics for numeric columns. |

In [3]:
# example: calling the info method
pollution.info()

<class 'pandas.core.frame.DataFrame'>
Index: 43800 entries, 2/01/2010 0:00 to 31/12/2014 23:00
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pollution  43800 non-null  int64  
 1   dew        43800 non-null  int64  
 2   temp       43800 non-null  float64
 3   press      43800 non-null  float64
 4   wnd_dir    43800 non-null  object 
 5   wnd_spd    43800 non-null  float64
 6   snow       43800 non-null  int64  
 7   rain       43800 non-null  int64  
dtypes: float64(3), int64(4), object(1)
memory usage: 3.0+ MB
