# Dataframes and data handling

Dataframes are tabular data structures with labeled axes (rows and columns, like in excel). These data structures are commonly used in data science using software libraries such as Pandas and R data frames (a reminder for those familiar with R).

Here we will use [pandas](https://pandas.pydata.org/docs/). Find a cheat sheet [here](http://datacamp-community-prod.s3.amazonaws.com/dbed353d-2757-4617-8206-8767ab379ab3) 

First, we will import two libraries: `pandas` and `random`

In [9]:

import pandas as pd
import random
import numpy as np

First, let's define a simple example data frame with 100 rows and two columns with random variables. 

In [10]:
num_rows = 1000
random_data_1 = np.random.normal(loc=2, scale=1, size=num_rows)
random_data_2 = np.random.normal(loc=0, scale=1, size=num_rows)

In [11]:
len(random_data_1)

1000

Then we create a **dictionary** (data structure representing a collection of variables of any type) with the two data lists.

In [12]:
data = {'Random variable 1': random_data_1,
        'Random variable 2': random_data_2
        }
data

{'Random variable 1': array([ 2.29288274e+00,  1.75875209e+00,  8.14931901e-01,  8.73689701e-01,
         3.16103782e+00,  1.34850295e+00,  1.21622737e+00,  1.35281520e+00,
         1.76700469e+00,  2.95409821e+00,  2.19110972e+00,  3.41054722e+00,
         1.74297946e+00,  8.96502608e-01,  1.31555486e+00,  2.47607452e+00,
         2.63613646e+00,  7.83982877e-01,  8.27665536e-01,  1.13944009e+00,
         2.44655144e+00,  1.34962906e+00,  2.01206503e+00,  2.22240104e+00,
         1.50311636e+00,  2.93404302e+00,  1.77937248e+00,  2.56970044e+00,
         2.11194644e+00,  8.14692551e-01,  3.29612896e+00,  2.78211717e+00,
         2.01230829e+00,  2.60168950e+00,  3.87354665e+00,  3.77586931e+00,
         2.09693463e+00,  7.19656251e-01,  1.12356808e+00,  7.65822665e-01,
         1.36869078e+00,  1.62093744e+00,  9.95829410e-01,  2.73898051e+00,
         2.43035790e+00,  1.88564701e+00,  2.39836056e+00,  8.88730224e-01,
         3.88373674e+00,  1.48716449e+00,  2.14908263e+00,  3.07173

In the dictionary, we now have two lists of numbers. 

With the `DataFrame` function from pandas (here integrated with the abbreviation `pd`), we can convert the dictionary into a data frame.

In [14]:
df = pd.DataFrame(data)
df

Unnamed: 0,Random variable 1,Random variable 2
0,2.292883,0.496205
1,1.758752,-1.239064
2,0.814932,0.830087
3,0.873690,-0.209762
4,3.161038,-1.321978
...,...,...
995,3.207313,0.487692
996,2.242479,-0.398073
997,2.074491,-1.533468
998,1.917045,-0.320334


In the output here, we can now see the tabular structure of the data frame. 

In addition, pandas include multiple functions that can be directly applied to a data frame:

`info()` returns a quick summary of the values with essential information that is included in the data frame. 

`corr()` returns a correlation matrix that informs about the similarities between variables (i.e., a high correlation r > .8 would indicate that two variables would be highly similar; an r around zero would indicate low similarity). 

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Random variable 1  1000 non-null   float64
 1   Random variable 2  1000 non-null   float64
dtypes: float64(2)
memory usage: 15.8 KB


In [7]:
df.corr()

Unnamed: 0,Random variable 1,Random variable 2
Random variable 1,1.0,0.049426
Random variable 2,0.049426,1.0


**Exercise**: Create a data frame with at least three columns. 

**Data handling**

Typically, a dataset is stored in a file. For example, one can export and store an excel sheet in a `.csv` file (CSV: comma-separated file). Similarly, we can store our data frame `df` in a `.csv` file with the `to_csv` function.

`df.to_csv("./datasets/data.csv", index=False)`

With this a new file (`data.csv`) is stored in the `./dataset/` folder that includes all values from the data frame.

When stored on github as in this course, you have not the rights to store the file in my repository.

In [8]:
def read_csv_from_github(url):
    import requests
    from io import StringIO
    response = requests.get(url)
    data = response.text
    return pd.read_csv(StringIO(data))


url = 'https://github.com/bgagl/ML_Individual_Differences/raw/5b70d36362172bb50d5be984e8c97526dda26bd2/datasets/data.csv'
data_from_github = read_csv_from_github(url=url)
data_from_github.head()

ImportError: urllib3 v2.0 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with LibreSSL 2.8.3. See: https://github.com/urllib3/urllib3/issues/2168

**Exercise**: Store your new dataframe in the file `new_random_data.csv` and load it again.