# Long format, wide format, pivot tables, and melting

This lesson is all about data transformation in pandas. Data transformation is in essense reorganizing the rows and columns of your dataset to be a different shape and format. 

The benefits to transforming your data are primarily for easier access and manipulation of data, whether it be through easier masking/conditional statements or because you would prefer to operate across columns or down rows. 

Over time you will get a feel for which data formats are better for different tasks. This lesson, however, is focused in large part on the _functional application_ of data transformation. How do you do this to a dataset?

---

## 1. "Wide" format data

**Wide** format data is, in my opinion, the more common format of data that you will start out with when you load in datasets. You are already familiar with wide format data: I believe all of the datasets we have been using thus far have been in wide format.

Wide format data is formatted like so:

- There are multiple ID _and_ value columns. In other words, there is a column for every "variable" with its own unique values.
- The format has both the conceptual simplicity of a single column of values per variable and a more compact matrix.
- Is not useful for SQL-style operations: it can make it much harder or even impossible to join tables together on a value.
- Can be more useful in pandas when you need to preform operations on variables **across columns**. For example, multiplying columns together.
- It is the most commonly the format that you will put the data in when you are ready to perform modeling (with some exceptions). When we get into modeling next week I will explain why.

In [1]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

sns.set_style('darkgrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

---

## 2. Load  "Nerdy Personality Attributes" dataset



In [2]:
nerdy_filename = '/Users/kiefer/github-repos/DSI-SF-2/datasets/nerdy_personality_attributes/NPAS-data.csv'

with open(nerdy_filename, 'r') as f:
    lines = [x.split(',') for x in f.read().split('\r\n')]
    print len(lines)

lines = [row for row in lines if len(row) == len(lines[0])]
print len(lines)

header = lines[0]
raw_dict = {
    h:[] for h in header
}

for i, h in enumerate(header):
    raw_dict[h] = [float(x[i]) if x[i].isdigit() else x[i] for x in lines[1:]]


nerdy = pd.DataFrame(raw_dict)
print nerdy.shape

1420
1392
(1391, 79)


In [3]:
rename_column_dict = {
        'Q1':'interested_science',
        'Q2':'in_advanced_classes',
        'Q3':'playes_rpgs',
        'Q4':'intelligence_over_appearance',
        'Q5':'collect_books',
        'Q6':'academic_over_social',
        'Q7':'watch_science_shows',
        'Q8':'like_dry_topics',
        'Q9':'like_science_fiction',
        'Q10':'books_over_parties',
        'Q11':'hobbies_over_people',
        'Q12':'libraries_over_publicspace',
        'Q13':'bookish',
        'Q14':'read_tech_reports',
        'Q15':'writing_novel',
        'Q16':'introspective',
        'Q17':'online_over_inperson',
        'Q18':'like_hard_material',
        'Q19':'play_many_videogames',
        'Q20':'was_odd_child',
        'Q21':'prefer_fictional_people',
        'Q22':'enjoy_learning',
        'Q23':'excited_about_research',
        'Q24':'strange_person',
        'Q25':'like_superheroes',
        'Q26':'socially_awkward',
        'TIPI1':'extraverted',
        'TIPI2':'critical',
        'TIPI3':'dependable',
        'TIPI4':'anxious',
        'TIPI5':'opennness',
        'TIPI6':'reserved',
        'TIPI7':'sympathetic',
        'TIPI8':'disorganized',
        'TIPI9':'calm',
        'TIPI10':'conventional',
        'race+AF8-arab':'race_arab',
        'race+AF8-asian':'race_asian',
        'race+AF8-black':'race_black',
        'race+AF8-white':'race_white',
        'race+AF8-hispanic':'race_hispanic',
        'race+AF8-nativeam':'race_native_american',
        'race+AF8-nativeau':'race_native_austrailian',
        'race+AF8-other':'race_nerdy',
        'ASD':'diagnosed_autistic'
    }

nerdy.rename(columns=rename_column_dict, inplace=True)

In [5]:
demo_cols = ['education','urban','gender','engnat','age','hand','religion',
            'voted','married','familysize','major']

column_selection = rename_column_dict.values() + demo_cols
nerdy = nerdy[column_selection]
nerdy['subject_id'] = range(nerdy.shape[0])

In [7]:
nerdy.to_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/nerdy_personality_attributes/NPAS_parsed_trunc.csv',
             index=False)
nerdy[demo_cols+['subject_id']].to_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/nerdy_personality_attributes/NPAS_parsed_trunc_demos.csv',
                                       index=False)
nerdy[[x for x in nerdy.columns if x not in demo_cols]].to_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/nerdy_personality_attributes/NPAS_parsed_trunc_survey.csv',
                                                               index=False)

In [8]:
nerdy_demo = nerdy[demo_cols+['subject_id']]
nerdy_survey = nerdy[[x for x in nerdy.columns if x not in demo_cols]]
print nerdy_demo.shape
print nerdy_survey.shape

(1391, 12)
(1391, 46)


In [9]:
nerdy_demo.head()

Unnamed: 0,education,urban,gender,engnat,age,hand,religion,voted,married,familysize,major,subject_id
0,3.0,2.0,2.0,1.0,69.0,1.0,6.0,1.0,3.0,4.0,Studio Art,0
1,4.0,2.0,2.0,1.0,50.0,1.0,1.0,1.0,1.0,3.0,biophysics,1
2,3.0,1.0,2.0,2.0,22.0,1.0,1.0,1.0,1.0,2.0,biology,2
3,4.0,3.0,1.0,1.0,44.0,1.0,2.0,2.0,3.0,4.0,Mathematics,3
4,1.0,1.0,2.0,1.0,17.0,1.0,7.0,2.0,1.0,1.0,,4


In [10]:
nerdy_survey.head()

Unnamed: 0,race_white,race_nerdy,race_native_american,writing_novel,read_tech_reports,online_over_inperson,introspective,hobbies_over_people,books_over_parties,bookish,...,reserved,conventional,was_odd_child,prefer_fictional_people,enjoy_learning,excited_about_research,strange_person,like_superheroes,socially_awkward,subject_id
0,1.0,0.0,0.0,3.0,5.0,4.0,5.0,4.0,5.0,5.0,...,7.0,1.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,0
1,1.0,0.0,0.0,1.0,4.0,3.0,3.0,1.0,4.0,4.0,...,5.0,1.0,3.0,3.0,3.0,4.0,4.0,4.0,5.0,1
2,1.0,0.0,0.0,4.0,5.0,5.0,5.0,5.0,5.0,5.0,...,7.0,1.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,2
3,1.0,0.0,0.0,4.0,4.0,5.0,2.0,5.0,5.0,4.0,...,2.0,1.0,5.0,4.0,1.0,5.0,5.0,5.0,5.0,3
4,1.0,0.0,0.0,1.0,5.0,5.0,1.0,4.0,5.0,5.0,...,6.0,2.0,5.0,5.0,4.0,4.0,5.0,4.0,0.0,4


In [11]:
subject_sample = np.random.choice(nerdy_demo.subject_id, size=700, replace=False)
nerdy_demo_samp = nerdy_demo.loc[nerdy_demo.subject_id.isin(subject_sample), :]

In [12]:
nerdy_demo_samp.to_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/nerdy_personality_attributes/NPAS_parsed_trunc_demo_sample.csv',
                       index=False)

In [13]:
nerdy_demo = nerdy_demo_samp
nerdy_survey = nerdy[[x for x in nerdy.columns if x not in demo_cols]]
print nerdy_demo.shape
print nerdy_survey.shape

(700, 12)
(1391, 46)


In [15]:
nerdy_demo_long = pd.melt(nerdy_demo, id_vars=['subject_id'])

In [16]:
nerdy_demo_long.head()

Unnamed: 0,subject_id,variable,value
0,1,education,4
1,2,education,3
2,5,education,2
3,6,education,2
4,7,education,2


In [17]:
nerdy_survey_long = pd.melt(nerdy_survey, id_vars=['subject_id'])

In [21]:
nerdy_long = pd.concat([nerdy_demo_long, nerdy_survey_long])
nerdy_long.to_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/nerdy_personality_attributes/NPAS_parsed_trunc_long_missing.csv',
                  index=False)

In [19]:
nerdy_long.shape

(70295, 3)

In [20]:
nerdy_demo_long.shape

(7700, 3)

In [None]:
nerdy_demo_long.