# Dataframes and data handling

Dataframes are tabular data structures with labeled axes (rows and columns, like in excel). These data structures are commonly used in data science using software libraries such as Pandas and R data frames (a reminder for those familiar with R).

Here we will use [pandas](https://pandas.pydata.org/docs/). Find a cheat sheet [here](http://datacamp-community-prod.s3.amazonaws.com/dbed353d-2757-4617-8206-8767ab379ab3) 

First, we will import two libraries: `pandas` and `random`

In [44]:
import pandas as pd
import random
import numpy as np

First, let's define a simple example data frame with 100 rows and two columns with random variables. 

In [79]:
num_rows = 1000
random_data_integer = [random.randint(100, 1000) for i in range(num_rows)] #list comprehension
random_data_float = np.random.normal(loc=0, scale=1, size=num_rows)

In [80]:
len(random_data_float)

1000

Then we create a **dictionary** (data structure representing a collection of variables of any type) with the two data lists.

In [81]:
data = {'Random Integer': random_data_integer, 'Random Float': random_data_float}
data

{'Random Integer': [445,
  625,
  593,
  873,
  924,
  577,
  154,
  255,
  826,
  582,
  726,
  154,
  903,
  153,
  532,
  589,
  481,
  625,
  196,
  147,
  300,
  773,
  152,
  417,
  876,
  741,
  594,
  145,
  563,
  717,
  676,
  674,
  417,
  382,
  784,
  798,
  641,
  762,
  496,
  198,
  278,
  686,
  228,
  955,
  327,
  917,
  637,
  493,
  669,
  231,
  551,
  603,
  188,
  450,
  821,
  625,
  123,
  701,
  517,
  731,
  570,
  103,
  844,
  417,
  227,
  618,
  556,
  332,
  944,
  207,
  221,
  510,
  727,
  732,
  693,
  521,
  738,
  324,
  143,
  985,
  277,
  101,
  815,
  106,
  364,
  369,
  642,
  257,
  883,
  802,
  155,
  942,
  551,
  503,
  127,
  264,
  828,
  175,
  868,
  147,
  871,
  401,
  875,
  769,
  977,
  147,
  740,
  801,
  159,
  102,
  198,
  273,
  549,
  752,
  575,
  499,
  643,
  249,
  227,
  154,
  900,
  204,
  917,
  738,
  197,
  684,
  659,
  772,
  705,
  680,
  409,
  597,
  512,
  565,
  137,
  625,
  917,
  593,
  536,
  933,
  

In the dictionary, we now have two lists of numbers. 

With the `DataFrame` function from pandas (here integrated with the abbreviation `pd`), we can convert the dictionary into a data frame.

In [82]:
df = pd.DataFrame(data)
df

Unnamed: 0,Random Integer,Random Float
0,445,-0.485877
1,625,0.976104
2,593,0.161145
3,873,0.613808
4,924,0.714427
...,...,...
995,210,-0.606710
996,1000,-0.330303
997,398,0.391045
998,843,0.201583


In the output here, we can now see the tabular structure of the data frame. 

In addition, pandas include multiple functions that can be directly applied to a data frame:

`info()` returns a quick summary of the values with essential information that is included in the data frame. 

`corr()` returns a correlation matrix that informs about the similarities between variables (i.e., a high correlation r > .8 would indicate that two variables would be highly similar; an r around zero would indicate low similarity). 

In [83]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Random Integer  1000 non-null   int64  
 1   Random Float    1000 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 15.8 KB


In [84]:
df.corr()

Unnamed: 0,Random Integer,Random Float
Random Integer,1.0,0.048654
Random Float,0.048654,1.0


**Exercise**: Create a data frame with at least three columns. 

**Data handling**

Typically, a dataset is stored in a file. For example, one can export and store an excel sheet in a `.csv` file (CSV: comma-separated file). Similarly, we can store our data frame `df` in a `.csv` file with the `to_csv` function. 

In [50]:
df.to_csv("./datasets/data.csv", index=False)

Now a new file (`data.csv`) is stored in the `./dataset/` folder that includes all values from the data frame. 

With the `read_csv()` function, one can load the file again and store it directly in a data frame. 

In [51]:
df_read_from_file = pd.read_csv("./datasets/data.csv")
df_read_from_file

Unnamed: 0,Random Integer,Random Float
0,90,-0.385837
1,5,1.146153
2,66,1.096592
3,8,-1.382547
4,22,0.629565
...,...,...
95,75,-1.186135
96,50,-0.776810
97,33,-0.759147
98,17,0.765862


**Exercise**: Store your new dataframe in the file `new_random_data.csv` and load it again.