# 1. Setup
We start off by importing the neccessary Python libraries we need. We also import the `essays` dataset (`essays.csv`) that we can find in the `data` directory of our project.

In [20]:
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

essays_raw = pd.read_csv('data/essays.csv', engine='python');
essays_raw.columns = essays_raw.columns.str.replace("c|#", "").str.lower()

# 2. Inpect the data

Let's take a quick look at the structure of the `essays` dataset.

In [30]:
# Print the first few rows of the dataset
print(essays_raw.head())

# Print the number of rows and colums
print("\nNumber of rows and columns: {} \n".format(essays_raw.shape))

# Check if author ids are unique
if len(essays_raw['authid']) == len(essays_raw):
    print('All author ids are unique!')
else:
    print('Author ids are not unique!')
    
# Check if there are missing values in the dataset:
essays_raw.isnull().sum()

            authid                                               text ext neu  \
0  1997_504851.txt  Well, right now I just woke up from a mid-day ...   n   y   
1  1997_605191.txt  Well, here we go with the stream of consciousn...   n   n   
2  1997_687252.txt  An open keyboard and buttons to push. The thin...   n   y   
3  1997_568848.txt  I can't believe it!  It's really happening!  M...   y   n   
4  1997_688160.txt  Well, here I go with the good old stream of co...   y   n   

  agr con opn  
0   y   n   y  
1   y   n   n  
2   n   y   y  
3   y   y   n  
4   y   n   y  

Number of rows and columns: (2467, 7) 

All author ids are unique!


authid    0
text      0
ext       0
neu       0
agr       0
con       0
opn       0
dtype: int64

We can see that we have a datasset containing 2467 essays from the same number of individual authors. We can also see that the dataset does not contain any missing values. Each essay is associated with an author id and 5 binary labels (one label per personality dimension):

* Extraversion (`ext`)
* Neuroticism (`neu`)
* Agreeableness (`agr`)
* Conscientiousness (`con`)
* Openess (`opn`)

Note that in psychological theory, the Big Five model actually considers all five traits as independent continious dimensions (and even defines sub dimensions - so-called facets - to each of them). However, for this machine learning task, the labels in our datasets represent just binary categories (e.g. a `y` in the `cNEU`-column indicates that the author of the given essay is neurotic).

Obviously, our goal is to predict the five binary labels for a given essay. In the other words, the task at hand is a binary multi label classification task. Before we create a train-test split and preprocess our data, let's explore it a little bit further.