# Dataframes and data handling

Dataframes are tabular data structures with labeled axes (rows and columns; like in excel). These data structures are commonly used in data science using software libraries such as Pandas and R dataframe (a reminder for those familiar with R).

Here we will use [pandas](https://pandas.pydata.org/) 

First we will import two libraries: `pandas` and `random`

In [15]:
import pandas as pd
import random

First lets define a simple example dataframe with ten rows and two colums with random variables. 

In [16]:
num_rows = 10
random_data_integer = [random.randint(0, 100) for i in range(num_rows)]
random_data_float = [random.uniform(0, 100) for i in range(num_rows)]

Then we create a **dictionary** (datastructure that represents a collection of variables of any type) with the two lists of data.

In [17]:
data = {'Random Integer': random_data_integer, 'Random Float': random_data_float}
data

{'Random Integer': [16, 81, 14, 73, 35, 60, 73, 9, 54, 59],
 'Random Float': [44.17359097614838,
  17.429419205459283,
  46.245183302692006,
  19.863547204363208,
  74.87940457529263,
  96.45444871184331,
  12.831316197790176,
  17.674797815512587,
  30.79320136269088,
  42.85239341643188]}

In the dictionary we now have two lists of numbers. 

With the `DataFrame` function from pandas (here integrated with the abbriviation of `pd`) we can convert the dictionary in a dataframe

In [18]:
df = pd.DataFrame(data)
df

Unnamed: 0,Random Integer,Random Float
0,16,44.173591
1,81,17.429419
2,14,46.245183
3,73,19.863547
4,35,74.879405
5,60,96.454449
6,73,12.831316
7,9,17.674798
8,54,30.793201
9,59,42.852393


In the output here we can now see the tabular structure of the dataframe. 

In addition, pandas includes multiple functions that can be directly applyed to a dataframe:

`info()` returns a quick summary of the values with basic informations that are included in the dataframe. 

`corr()` returns a correlation matrix that informs about the similarities between variables (i.e., a high correlation r > .8 would indicate that two variables would be highly similar; an r arround zero would indicate low similarity). 

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Random Integer  10 non-null     int64  
 1   Random Float    10 non-null     float64
dtypes: float64(1), int64(1)
memory usage: 288.0 bytes


In [24]:
df.corr()

Unnamed: 0,Random Integer,Random Float
Random Integer,1.0,-0.181792
Random Float,-0.181792,1.0


**Exercise**: Create a dataframe with at least 3 columns. 

**Data handling**

Typically, a dataset is stored in a file. For example, one can export and store a excel sheet in a `.csv` file (CSV: comma seperated file). Similarly, we can store our dataframe `df` in a `.csv` file with the `to_csv` function. 

In [26]:
df.to_csv("./datasets/data.csv", index=False)

Now a new file (`data.csv`) is stored in the `./dataset/` folder that includes all values from the dataframe. 

With the `read_csv()` function one can read the file again and store it directly in a dataframe again. 

In [27]:
df_read_from_file = pd.read_csv("./datasets/data.csv")
df_read_from_file

Unnamed: 0,Random Integer,Random Float
0,16,44.173591
1,81,17.429419
2,14,46.245183
3,73,19.863547
4,35,74.879405
5,60,96.454449
6,73,12.831316
7,9,17.674798
8,54,30.793201
9,59,42.852393


**Exercise**: Store your new dataframe in the file `new_random_data.csv` and load it again.