# Dataframes and data handling

Dataframes are tabular data structures with labeled axes (rows and columns, like in excel). These data structures are commonly used in data science using software libraries such as Pandas and R data frames (a reminder for those familiar with R).

Here we will use [pandas](https://pandas.pydata.org/docs/). Find a cheat sheet [here](http://datacamp-community-prod.s3.amazonaws.com/dbed353d-2757-4617-8206-8767ab379ab3) 

First, we will import two libraries: `pandas` and `random`

In [44]:
import pandas as pd
import random
import numpy as np

First, let's define a simple example data frame with 100 rows and two columns with random variables. 

In [45]:
num_rows = 100
random_data_integer = [random.randint(0, 100) for i in range(num_rows)]
random_data_float = np.random.normal(loc=0, scale=1, size=100)

Then we create a **dictionary** (data structure representing a collection of variables of any type) with the two data lists.

In [46]:
data = {'Random Integer': random_data_integer, 'Random Float': random_data_float}
data

{'Random Integer': [90,
  5,
  66,
  8,
  22,
  90,
  18,
  72,
  50,
  31,
  79,
  63,
  26,
  51,
  67,
  43,
  60,
  48,
  33,
  72,
  76,
  97,
  18,
  53,
  31,
  39,
  95,
  56,
  51,
  43,
  92,
  97,
  76,
  40,
  62,
  27,
  26,
  83,
  84,
  44,
  93,
  93,
  39,
  17,
  99,
  89,
  18,
  59,
  47,
  25,
  22,
  99,
  8,
  21,
  30,
  100,
  74,
  49,
  96,
  17,
  73,
  34,
  31,
  87,
  63,
  62,
  74,
  64,
  81,
  67,
  79,
  71,
  99,
  3,
  27,
  85,
  0,
  9,
  1,
  19,
  23,
  19,
  23,
  63,
  39,
  6,
  42,
  50,
  79,
  51,
  35,
  29,
  65,
  25,
  78,
  75,
  50,
  33,
  17,
  78],
 'Random Float': array([-0.38583693,  1.1461526 ,  1.09659196, -1.38254705,  0.62956462,
         0.84407904,  1.50625559, -0.55132015, -1.52648934,  2.02323748,
         0.61256809,  0.50883438, -0.9497414 ,  1.33577742,  0.93359298,
        -1.36556855,  0.73874532,  0.33358013,  0.49032932,  0.78373047,
        -0.14443964,  0.22238854,  0.25919312, -0.72568249, -0.64898114,
       

In the dictionary, we now have two lists of numbers. 

With the `DataFrame` function from pandas (here integrated with the abbreviation `pd`), we can convert the dictionary into a data frame.

In [47]:
df = pd.DataFrame(data)
df

Unnamed: 0,Random Integer,Random Float
0,90,-0.385837
1,5,1.146153
2,66,1.096592
3,8,-1.382547
4,22,0.629565
...,...,...
95,75,-1.186135
96,50,-0.776810
97,33,-0.759147
98,17,0.765862


In the output here, we can now see the tabular structure of the data frame. 

In addition, pandas include multiple functions that can be directly applied to a data frame:

`info()` returns a quick summary of the values with essential information that is included in the data frame. 

`corr()` returns a correlation matrix that informs about the similarities between variables (i.e., a high correlation r > .8 would indicate that two variables would be highly similar; an r around zero would indicate low similarity). 

In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Random Integer  100 non-null    int64  
 1   Random Float    100 non-null    float64
dtypes: float64(1), int64(1)
memory usage: 1.7 KB


In [49]:
df.corr()

Unnamed: 0,Random Integer,Random Float
Random Integer,1.0,-0.038313
Random Float,-0.038313,1.0


**Exercise**: Create a data frame with at least three columns. 

**Data handling**

Typically, a dataset is stored in a file. For example, one can export and store an excel sheet in a `.csv` file (CSV: comma-separated file). Similarly, we can store our data frame `df` in a `.csv` file with the `to_csv` function. 

In [50]:
df.to_csv("./datasets/data.csv", index=False)

Now a new file (`data.csv`) is stored in the `./dataset/` folder that includes all values from the data frame. 

With the `read_csv()` function, one can load the file again and store it directly in a data frame. 

In [51]:
df_read_from_file = pd.read_csv("./datasets/data.csv")
df_read_from_file

Unnamed: 0,Random Integer,Random Float
0,90,-0.385837
1,5,1.146153
2,66,1.096592
3,8,-1.382547
4,22,0.629565
...,...,...
95,75,-1.186135
96,50,-0.776810
97,33,-0.759147
98,17,0.765862


**Exercise**: Store your new dataframe in the file `new_random_data.csv` and load it again.