# __Tutorial 1__

# Reading data into python

The data you will analyze in this course will generally be provided in CSV format. The easiest way to read and work with csv data in Python is via the [pandas Library](https://pandas.pydata.org/). Pandas provides convenient data structures to manipulate and analyze data, built using the NumPy (a Python package to work with array elements). Consider taking a look at the [10 minute intro to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html).

# Use the Python Pandas library for data manipulation and analysis.
## Reading data into python with pandas

We are following the convention to import Pandas in Python with the pd alias.

In [1]:
import numpy as np
import pandas as pd

With Pandas you can use the read_csv(path_to_file) to automatically read and parse your csv. The only required arguement is the path to the csv file as a string or path object. The file can be hosted locally on your computer or online. Documentation for this method can be found [here](https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.read_csv.html).

In this example we'll use the board thickness dataset

There are many ways to read the dataset.  Use the .read_csv() to read in the dataset and store it as a Dataframe object in the variable that we choose to name it, here all_boards

## Read from os
The os commands can be used when the data is in the same folder as the python file - it defines the relative path between the files
```python
import os```\
```all_boards = pd.read_csv(os.getcwd() + os.sep + "six-point-board-thickness.csv")```

## Read from open
Import the dataset directly into your working directory and use it
```all_boards = pd.read_csv('six-point-board-thickness.csv')```

## Read from local computer
```all_boards = pd.read_csv('C:/Users/person/Desktop/4C03/six-point-board-thickness')```

## Read from online URL
You can import from a URL with an import request, which will save he .csv to your working directory.   
```import requests```\
```download_url = "https://raw.githubusercontent.com/leebej/IBEHS4C03/main/ReadAndExploreData/six-point-board-thickness.csv?token=AWHGY67RONJF5BKT3HXZ6XDBPNB7K"```\
```target_csv_path = "six-point-board-thickness.csv"```\
```response = requests.get(download_url)```\
```response.raise_for_status()    # Check that the request was successful```\
```with open(target_csv_path, "wb") as f:```\
    ```f.write(response.content)```\    
```print("Download ready.")```\
```all_boards = pd.read_csv('six-point-board-thickness.csv')```

In [2]:
all_boards=pd.read_csv('six-point-board-thickness.csv')
print(all_boards)

             Date.Time  Pos1  Pos2  Pos3  Pos4  Pos5  Pos6
0      2010-02-18 3:04  1761  1739  1758  1677  1684  1692
1      2010-02-18 3:37  1801  1688  1753  1741  1692  1675
2      2010-02-18 3:37  1697  1682  1663  1671  1685  1651
3      2010-02-18 3:37  1679  1712  1672  1703  1683  1674
4      2010-02-18 3:37  1699  1688  1699  1678  1688  1705
...                ...   ...   ...   ...   ...   ...   ...
4995  2010-02-18 13:15  1690  1701  1690  1694  1735  1695
4996  2010-02-18 13:15  1703  1674  1666  1694  1659  1728
4997  2010-02-18 13:16  1657  1667  1675  1654  1648  1609
4998  2010-02-18 13:16  1746  1717  1638  1723  1703  1706
4999  2010-02-18 13:16  1668  1680  1668  1669  1651  1629

[5000 rows x 7 columns]


# Dataframes

The dataframe provides convenient built in ways to query the dataset, manipulate the data, and analyze the data.  Like the dataset, the dataframe in this case has 5000 rows and 7 columns.    
The type() function is used to get the type of the object and check that it is a Pandas Dataframe.  
The len() function will show the number of rows.  
There are 3 attributes to describe the size of the dataframe:  
> The .shape attribute will show the dimensionality.  The result is a tuple containing the number of rows and columns.  
The .ndim atribute will show the number of dimensions of the dataframe.  
The .size attribute will show the total number of values.  

There are 3 components of the dataframe: This is what makes the arrangement of a data matrix tidy. First you should arrange, or tidy, your data into the form that you want.  
> The columns names can be found with the .columns attribute.    
The .index attribute returns the row labels  
The .values attribute returns the dataframe values.  You can also use the .to_numpy() to create a 2D values array. 

Take a look at the first 5 rows of the dataframe with .head()

In [17]:
print(type(all_boards))
len(all_boards)

all_boards.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Date.Time,Pos1,Pos2,Pos3,Pos4,Pos5,Pos6
0,2010-02-18 3:04,1761,1739,1758,1677,1684,1692
1,2010-02-18 3:37,1801,1688,1753,1741,1692,1675
2,2010-02-18 3:37,1697,1682,1663,1671,1685,1651
3,2010-02-18 3:37,1679,1712,1672,1703,1683,1674
4,2010-02-18 3:37,1699,1688,1699,1678,1688,1705


# Get to know the dataframe. 
You have imported a CSV file and had a first look at the data.  Now let's learn to examine the data systematically.  
First, take a look at the different data types that the dataframe contains.  The columns of the dataframe contain specific data types.  Remember that a coulmn of a dataframe is a series object.  You can display all coumns with the data types with .info()  
Or use the attribute .dtypes to return a series object with column names as labels ad corresponding data types as values.  
Pandas uses the NumPy library to work with these data types.

## Accessing elements and manipulating data

In this case, we're only interested in the board positions. The time the measurements were taken don't matter to us so we can drop that column (Date.Time) from the dataframe using the [.drop() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html). Note that all of these data manipulation operators return another dataframe object so the same methods are applicable to the transformed data as well.

Columns can be access by using their name in square brackers \[\] while rows can be access using their row (index) number and the [.iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) property. iloc works similar to indexing on a list, the same type of slicing can be used \[start:end\].

You can use the .head() or .tail() methods to get a certain amount of elements from the top (head) or bottom (tail) of the dataframe. Syntactically they're the same, so I'll only show an example of one

Row and column access can also be combined together. The head and tail methods would work here as well.

It is also possible to filter the dataframe based on the value in certain columns. Filtering the values returns a new dataframe with just the values that meet the condition (using the [.loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) property). 

Separate conditionals can be combined using boolean logic. In this case each conditional needs to be written in round brackets and the symbols change slightly. The element-wise logical symbols for use in these statements are:
* and: &
* or: |
* not: ~ 

Alternately, the .query() method can be used to succinctly query the dataframe.