# Dataframes and data handling

Dataframes are tabular data structures with labeled axes (rows and columns, like in excel). These data structures are commonly used in data science using software libraries such as Pandas and R data frames (a reminder for those familiar with R).

Here we will use [pandas](https://pandas.pydata.org/docs/). Find a cheat sheet [here](http://datacamp-community-prod.s3.amazonaws.com/dbed353d-2757-4617-8206-8767ab379ab3) 

First, we will import two libraries: `pandas` and `random`

In [1]:
import pandas as pd
import random
import numpy as np

First, let's define a simple example data frame with 100 rows and two columns with random variables. 

In [2]:
num_rows = 1000
random_data_integer = [random.randint(100, 1000) for i in range(num_rows)] #list comprehension
random_data_float = np.random.normal(loc=0, scale=1, size=num_rows)

In [3]:
len(random_data_float)

1000

Then we create a **dictionary** (data structure representing a collection of variables of any type) with the two data lists.

In [4]:
data = {'Random Integer': random_data_integer, 'Random Float': random_data_float}
data

{'Random Integer': [682,
  678,
  721,
  379,
  760,
  779,
  770,
  452,
  175,
  133,
  368,
  223,
  183,
  889,
  504,
  344,
  255,
  640,
  830,
  337,
  307,
  636,
  477,
  546,
  535,
  946,
  322,
  952,
  500,
  433,
  836,
  520,
  863,
  223,
  199,
  330,
  253,
  264,
  961,
  590,
  267,
  273,
  366,
  541,
  132,
  386,
  583,
  569,
  810,
  869,
  899,
  618,
  715,
  215,
  820,
  515,
  204,
  857,
  277,
  844,
  406,
  207,
  139,
  434,
  533,
  205,
  698,
  706,
  400,
  979,
  983,
  375,
  915,
  495,
  111,
  153,
  376,
  806,
  810,
  937,
  440,
  997,
  283,
  956,
  863,
  429,
  517,
  372,
  797,
  245,
  749,
  409,
  757,
  358,
  404,
  281,
  361,
  105,
  506,
  588,
  221,
  220,
  473,
  655,
  835,
  627,
  683,
  948,
  598,
  559,
  720,
  590,
  521,
  736,
  455,
  940,
  467,
  568,
  501,
  647,
  347,
  130,
  534,
  504,
  654,
  141,
  246,
  167,
  539,
  766,
  464,
  527,
  207,
  751,
  980,
  256,
  146,
  958,
  219,
  217,
  

In the dictionary, we now have two lists of numbers. 

With the `DataFrame` function from pandas (here integrated with the abbreviation `pd`), we can convert the dictionary into a data frame.

In [5]:
df = pd.DataFrame(data)
df

Unnamed: 0,Random Integer,Random Float
0,682,0.623147
1,678,0.863319
2,721,-0.588700
3,379,-0.353956
4,760,0.682922
...,...,...
995,577,-0.314786
996,937,-0.518871
997,999,2.024506
998,683,2.544311


In the output here, we can now see the tabular structure of the data frame. 

In addition, pandas include multiple functions that can be directly applied to a data frame:

`info()` returns a quick summary of the values with essential information that is included in the data frame. 

`corr()` returns a correlation matrix that informs about the similarities between variables (i.e., a high correlation r > .8 would indicate that two variables would be highly similar; an r around zero would indicate low similarity). 

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Random Integer  1000 non-null   int64  
 1   Random Float    1000 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 15.8 KB


In [7]:
df.corr()

Unnamed: 0,Random Integer,Random Float
Random Integer,1.0,0.013009
Random Float,0.013009,1.0


**Exercise**: Create a data frame with at least three columns. 

**Data handling**

Typically, a dataset is stored in a file. For example, one can export and store an excel sheet in a `.csv` file (CSV: comma-separated file). Similarly, we can store our data frame `df` in a `.csv` file with the `to_csv` function.

`df.to_csv("./datasets/data.csv", index=False)`

With this a new file (`data.csv`) is stored in the `./dataset/` folder that includes all values from the data frame.

When stored on github as in this course, you have not the rights to store the file in my repository.

In [None]:
def read_csv_from_github(url):
    import requests
    from io import StringIO
    response = requests.get(url)
    data = response.text
    return pd.read_csv(StringIO(data))


url = 'https://github.com/bgagl/ML_Individual_Differences/raw/5b70d36362172bb50d5be984e8c97526dda26bd2/datasets/data.csv'
data_from_github = read_csv_from_github(url=url)
data_from_github.head()

**Exercise**: Store your new dataframe in the file `new_random_data.csv` and load it again.