# MBTI - 16 Personalities classifier

# Data acquisition

## MBTI Dataset
The **MBTI** (Myers-Briggs Personality Type Indicator) divides everyone into 16 distinct personality types across 4 axis:
1. Introversion (I) – Extroversion (E)
2. Intuition (N) – Sensing (S)
3. Thinking (T) – Feeling (F)
4. Judging (J) – Perceiving (P)

This system is used in _businesses, online,_ for _fun,_ for _research_ and lots more.

This **dataset** contains over *8600 rows of data*, on each row is a person’s:
* Type (4 letter MBTI code/type)
* A section of each of the last 50 things they have posted (Each entry separated by "|||" (3 pipe characters))

## Acknowledgements
This data was collected through the PersonalityCafe forum, as it provides a large selection of people and their MBTI personality type, as well as what they have written.

## Downloading the dataset onto the cluster

I hosted the dataset on my [github](https://github.com/edu-rinaldi/MBTI-Predictor/blob/main/dataset/mbti_1.csv.zip), so that I and whoever want to train a different model can download it in an easier way.

In [0]:
%sh wget -P /tmp https://github.com/edu-rinaldi/MBTI-Predictor/raw/main/dataset/mbti_1.csv

In [0]:
dbutils.fs.mv("file:/tmp/mbti_1.csv", "/bdc-2020-21/datasets/")

In [0]:
%fs ls /bdc-2020-21/datasets/

path,name,size
dbfs:/bdc-2020-21/datasets/AllProductReviews.csv,AllProductReviews.csv,2496851
dbfs:/bdc-2020-21/datasets/mbti_1.csv,mbti_1.csv,62856486
dbfs:/bdc-2020-21/datasets/mnist-train.csv.bz2,mnist-train.csv.bz2,6732270


# Importing Spark (and some other cool things)
Python's imports

In [0]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

Now let's check if everything is ok

In [0]:
spark

# Data analysis

Now that we have downloaded the dataset and we have imported Spark framework, we can start analyzing the dataset... but first let's load the dataset into a spark dataframe:

In [0]:
mbti_df = spark.read.load("dbfs:/bdc-2020-21/datasets/mbti_1.csv", 
                         format="csv", 
                         sep=",", 
                         inferSchema="true", 
                         header="true"
                         )

In [0]:
mbti_df.show(5)

...TBC...