# Dataframes and data handling

Dataframes are tabular data structures with labeled axes (rows and columns; like in excel). These data structures are commonly used in data science using software libraries such as Pandas and R dataframe (a reminder for those familiar with R).

Here we will use [pandas](https://pandas.pydata.org/docs/). Find a cheat sheet [here](http://datacamp-community-prod.s3.amazonaws.com/dbed353d-2757-4617-8206-8767ab379ab3) 

First we will import two libraries: `pandas` and `random`

In [36]:
import pandas as pd
import random
import numpy as np

First lets define a simple example dataframe with 100 rows and two colums with random variables. 

In [37]:
num_rows = 100
random_data_integer = [random.randint(0, 100) for i in range(num_rows)]
random_data_float = np.random.normal(loc=0, scale=1, size=100)

Then we create a **dictionary** (datastructure that represents a collection of variables of any type) with the two lists of data.

In [38]:
data = {'Random Integer': random_data_integer, 'Random Float': random_data_float}
data

{'Random Integer': [12,
  30,
  74,
  18,
  14,
  7,
  8,
  48,
  90,
  18,
  95,
  63,
  91,
  49,
  73,
  34,
  38,
  55,
  22,
  21,
  80,
  59,
  16,
  29,
  98,
  56,
  13,
  98,
  63,
  22,
  33,
  0,
  65,
  72,
  74,
  36,
  22,
  85,
  34,
  36,
  21,
  30,
  76,
  90,
  11,
  33,
  67,
  59,
  54,
  9,
  54,
  4,
  30,
  28,
  70,
  84,
  11,
  87,
  55,
  97,
  71,
  20,
  72,
  79,
  66,
  18,
  24,
  3,
  51,
  69,
  27,
  17,
  89,
  36,
  71,
  50,
  51,
  97,
  15,
  55,
  65,
  10,
  69,
  42,
  80,
  57,
  63,
  98,
  30,
  39,
  96,
  1,
  82,
  75,
  81,
  100,
  40,
  24,
  86,
  99],
 'Random Float': array([ 1.73011685, -1.21550732, -0.21105616, -0.37548494, -0.34645501,
        -0.55530602, -0.52664169, -0.08106634,  0.02150589,  0.98968158,
        -1.98436589,  0.59126804,  0.56739495,  1.4761076 , -0.60251177,
         1.82048021,  1.77069305, -0.18062752,  1.277264  , -0.35035954,
         0.27346137, -0.43947682,  0.95346465, -0.75697987, -0.87954282,
      

In the dictionary we now have two lists of numbers. 

With the `DataFrame` function from pandas (here integrated with the abbriviation of `pd`) we can convert the dictionary in a dataframe

In [39]:
df = pd.DataFrame(data)
df

Unnamed: 0,Random Integer,Random Float
0,12,1.730117
1,30,-1.215507
2,74,-0.211056
3,18,-0.375485
4,14,-0.346455
...,...,...
95,100,0.427353
96,40,-0.224032
97,24,-0.991580
98,86,-1.154679


In the output here we can now see the tabular structure of the dataframe. 

In addition, pandas includes multiple functions that can be directly applyed to a dataframe:

`info()` returns a quick summary of the values with basic informations that are included in the dataframe. 

`corr()` returns a correlation matrix that informs about the similarities between variables (i.e., a high correlation r > .8 would indicate that two variables would be highly similar; an r arround zero would indicate low similarity). 

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Random Integer  100 non-null    int64  
 1   Random Float    100 non-null    float64
dtypes: float64(1), int64(1)
memory usage: 1.7 KB


In [41]:
df.corr()

Unnamed: 0,Random Integer,Random Float
Random Integer,1.0,-0.062587
Random Float,-0.062587,1.0


**Exercise**: Create a dataframe with at least 3 columns. 

**Data handling**

Typically, a dataset is stored in a file. For example, one can export and store a excel sheet in a `.csv` file (CSV: comma seperated file). Similarly, we can store our dataframe `df` in a `.csv` file with the `to_csv` function. 

In [42]:
df.to_csv("./datasets/data.csv", index=False)

Now a new file (`data.csv`) is stored in the `./dataset/` folder that includes all values from the dataframe. 

With the `read_csv()` function one can read the file again and store it directly in a dataframe again. 

In [43]:
df_read_from_file = pd.read_csv("./datasets/data.csv")
df_read_from_file

Unnamed: 0,Random Integer,Random Float
0,12,1.730117
1,30,-1.215507
2,74,-0.211056
3,18,-0.375485
4,14,-0.346455
...,...,...
95,100,0.427353
96,40,-0.224032
97,24,-0.991580
98,86,-1.154679


**Exercise**: Store your new dataframe in the file `new_random_data.csv` and load it again.