<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/DataScience_05_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Visualizing Data
Data visualization is a powerful tool that helps us make sense of complex information. In this chapter, you'll work with the BFI (“Big Five Inventory”) dataset, which looks at five major personality traits: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. This dataset offers a hands-on way to explore real-world data.

You'll learn how to create different types of plots using the Matplotlib library. From simple scatter plots to more complex histograms, you'll see how to present data clearly and effectively. You'll also discover how to customize your plots with labels and colors to make them more informative.

Understanding what the data means is crucial, and you'll find out how "data dictionaries" help with this. These guides explain the meaning and structure of the data and will be an essential part of your work with the BFI dataset.

But there's more to this chapter than just techniques. You'll also think about some big questions. How do visualizations represent reality? Are they true pictures of what's there, or just useful tools? This is part of a philosophical debate called the realism-instrumentalism debate, and you'll explore it using the BFI dataset as an example.

This chapter offers a practical and thoughtful look at data visualization. You'll gain skills that are useful in many fields, and you'll also think about what those skills mean. Whether you're interested in psychology, data science, or just curious about how to represent information, this chapter has something to offer you.

## Background to the Big Five
The **Big Five Model of Personality**, also known as the Five-Factor Model (FFM), identifies five broad dimensions that describe human personality at the highest level of organization. These dimensions are Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism.

**Openness to Experience** refers to an individual's willingness to engage with new ideas, experiences, emotions, and actions. People high in openness might be found trying new foods, reading various types of literature, or engaging in artistic pursuits, and they are often seen as imaginative, curious, and creative.

**Conscientiousness** is characterized by being organized, responsible, and hard-working. Someone with high conscientiousness might maintain a detailed schedule, keep their living space neat, and be punctual, showing a goal-oriented and reliable nature.

**Extraversion** refers to the degree to which a person is outgoing, sociable, and enjoys engaging with others. Those who score high on extraversion are often seen at social gatherings, actively participating in community events, and making new friends with energy and enthusiasm.

**Agreeableness** relates to how cooperative, warm, and considerate a person is in interactions with others. Someone with high agreeableness is likely to be compassionate, understanding, and willing to help others, often volunteering or assisting friends and family.

**Neuroticism** refers to the tendency to experience negative emotions such as anxiety, depression, and anger. A person with high neuroticism may often feel stressed, worry about small things, or be easily irritated, struggling to cope with daily challenges and experiencing mood swings.

The importance of the Big Five Model of Personality is manifold. It assists in understanding why people behave differently in similar situations, offers guidance in career and relationships by aligning with individual traits, and aids psychologists in the diagnosis and treatment of mental health issues. By categorizing personality into these five broad dimensions, the Big Five Model provides a clear and simple way to describe complex human behaviors. Whether used in personal development, professional guidance, or therapeutic settings, this model has become an essential tool in psychology and continues to play a significant role in the understanding of human personality.

## Loading the Big Five Data Set
Now, let's load the Big Five data set and take a look it. First, we'll display the head:

In [1]:
!pip install pydataset -q # Install required packages
from pydataset import data # Import required modules
import pandas as pd

bfi_df = data('bfi') # Load the Big Five Inventory dataset
bfi_df.head() # display first five rows

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pydataset (setup.py) ... [?25l[?25hdone
initiated datasets repo at: /root/.pydataset/


Unnamed: 0,A1,A2,A3,A4,A5,C1,C2,C3,C4,C5,...,N4,N5,O1,O2,O3,O4,O5,gender,education,age
61617,2.0,4.0,3.0,4.0,4.0,2.0,3.0,3.0,4.0,4.0,...,2.0,3.0,3.0,6,3.0,4.0,3.0,1,,16
61618,2.0,4.0,5.0,2.0,5.0,5.0,4.0,4.0,3.0,4.0,...,5.0,5.0,4.0,2,4.0,3.0,3.0,2,,18
61620,5.0,4.0,5.0,4.0,4.0,4.0,5.0,4.0,2.0,5.0,...,2.0,3.0,4.0,2,5.0,5.0,2.0,2,,17
61621,4.0,4.0,6.0,5.0,5.0,4.0,4.0,3.0,5.0,5.0,...,4.0,1.0,3.0,3,4.0,3.0,5.0,2,,17
61622,2.0,3.0,3.0,4.0,5.0,4.0,4.0,5.0,3.0,2.0,...,4.0,3.0,3.0,3,4.0,3.0,3.0,1,,17




### Digging Deeper With Data Dictionaries
Now, that we've loaded or data, we still need to figure what all of the columns "mean."  After all, understanding the variables and attributes within a dataset is essential for proper analysis and interpretation. However, to gain deeper insights into the data, we need a comprehensive guide that explains the meaning and characteristics of each attribute. This is where the concept of a data dictionary becomes pivotal.

A **data dictionary** is a collection of descriptions, definitions, and information about the data in a database or dataset. It serves as a compass for researchers, analysts, and users, guiding them to understand what each attribute means, its significance, and how it should be interpreted. By including details such as the attribute name, description, data type, constraints, and relationships, a data dictionary provides a roadmap to navigate the complex landscape of a dataset.

The importance of a data dictionary extends beyond mere understanding. It plays a crucial role in ensuring data quality and consistency by defining constraints and valid values. This standardized approach reduces errors and inconsistencies, facilitating smooth data collection and entry. Moreover, in a collaborative environment, a data dictionary acts as a common language, bridging gaps in communication among team members. It also serves as a vital tool for compliance and documentation, especially in regulated industries.

As it turns out, the Big Five Data set also includes a data dictionary, which tells about the "meaning" of these different items.

In [15]:
# Show the data dictionary
data('bfi.dictionary')[["ItemLabel","Item","Keying"]]

Unnamed: 0,ItemLabel,Item,Keying
A1,q_146,Am indifferent to the feelings of others.,-1.0
A2,q_1162,Inquire about others' well-being.,1.0
A3,q_1206,Know how to comfort others.,1.0
A4,q_1364,Love children.,1.0
A5,q_1419,Make people feel at ease.,1.0
C1,q_124,Am exacting in my work.,1.0
C2,q_530,Continue until everything is perfect.,1.0
C3,q_619,Do things according to a plan.,1.0
C4,q_626,Do things in a half-way manner.,-1.0
C5,q_1949,Waste my time.,-1.0


As mentioned above, the bfi measures five major personality traits, often referred to as the "Big Five". Each of these are represented here.

1.  Agreeableness (A): Describes a person's tendency to be compassionate and cooperative.
2.  Conscientiousness (C): Represents a person's self-discipline and aim for achievement.
3.  Extraversion (E): Indicates how outgoing and energetic a person is.
4.  Neuroticism (N): Reflects the emotional stability and mood changes in a person.
5.  Openness (O): Shows a person's willingness to experience new things.

The dataset usually includes statements related to these traits, and respondents rate their agreement or disagreement.

#### Data Dictionary Structure

The data dictionary provided breaks down into three columns: `ItemLabel`, `Item`, and `Keying`. Here's what each column represents:

1.  ItemLabel: This is a shorthand code for the question, making it easier to reference specific items. For example, "A1" represents the first question related to Agreeableness.
2.  Item: This represents the internal question code, like "q_146." It is used for tracking and managing the questionnaire.
3.  Keying: This column defines the scoring direction. A "1.0" means agreement with the statement increases the trait score, while a "-1.0" means agreement decreases the score. For example, agreeing with "Am indifferent to the feelings of others" decreases the Agreeableness score.

In addition to the trait questions, there are demographic variables like `gender`, `education`, and `age`.

#### Examples

1.  A1 (q_146, -1.0): If a respondent agrees with being indifferent to the feelings of others, the Agreeableness score decreases.
2.  C2 (q_530, 1.0): Agreeing with continuing until everything is perfect increases the Conscientiousness score.
3.  gender (males=1, females=2): Encodes the gender information, with males represented by 1 and females by 2.

OK. so now we know what the data labels mean. Let's start exploring our data!

## Getting to Know the Data

Let's explore our data using some of the methods we learned in previous chapters. First, let's get the big picture view with `.shape` and `.describe()`

In [18]:
bfi_df.shape

(2800, 28)

It appears we have 2800 data items, each representing a different person who has been given our personality test. We also have 28 columns, which correspond to their answers to different questions, as well as to basic biographial information (like age and sex).

In [16]:
bfi_df.describe()

Unnamed: 0,A1,A2,A3,A4,A5,C1,C2,C3,C4,C5,...,N4,N5,O1,O2,O3,O4,O5,gender,education,age
count,2784.0,2773.0,2774.0,2781.0,2784.0,2779.0,2776.0,2780.0,2774.0,2784.0,...,2764.0,2771.0,2778.0,2800.0,2772.0,2786.0,2780.0,2800.0,2577.0,2800.0
mean,2.413434,4.80238,4.603821,4.699748,4.560345,4.502339,4.369957,4.303957,2.553353,3.296695,...,3.185601,2.969686,4.816055,2.713214,4.438312,4.892319,2.489568,1.671786,3.190144,28.782143
std,1.407737,1.17202,1.301834,1.479633,1.258512,1.241347,1.318347,1.288552,1.375118,1.628542,...,1.569685,1.618647,1.12953,1.565152,1.220901,1.22125,1.327959,0.469647,1.107714,11.127555
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0
25%,1.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0,2.0,...,2.0,2.0,4.0,1.0,4.0,4.0,1.0,1.0,3.0,20.0
50%,2.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,2.0,3.0,...,3.0,3.0,5.0,2.0,5.0,5.0,2.0,2.0,3.0,26.0
75%,3.0,6.0,6.0,6.0,5.0,5.0,5.0,5.0,4.0,5.0,...,4.0,4.0,6.0,4.0,5.0,6.0,3.0,2.0,4.0,35.0
max,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,...,6.0,6.0,6.0,6.0,6.0,6.0,6.0,2.0,5.0,86.0


From this, we can see a few things:
1. It appears as though each items is "scored" between 1 (min) and 6 (max), which presumably corresponds to "Strongly Disagree" to "Strongly Agree."
2. From the data dictionary above, we know that some of these are "negative", while others are "positive." So, for example, agreeing strongly with A1 (about being indifferent to the feelings of others) will DETRACT from agreeability while agreeing A2 will ADD to agreeability.
3. We can also see things like the mean, standard deviation, median (50%).

Now, let's see what we can learn about our data set using visualizations!


## Data Cleaning
For this chapter, we're only going to focus on the "Big Five" measure of personality. With this in mind, we can eliminate the columns we don't need. We'll also clean up some other columns to make them more readable.

In [None]:
bfi_df["Big5"]