# Data Restructure
> The practice and procedure of achieving Tidy Data
- toc:true -branch:master -badges:true
- comments:false
- categories: [fastpages, jupyter]



![](my_icons/clean.jpeg)

## Introduction

Some couple of weeks ago, I can remember I drew your attention to the bitter truth about Data Science. In that piece, I posited that beyond the 'sexiness' and other euphemisms wrapped around the practice of Data Science, Data Science is Data wrangling and cleaning! How else can one describe a profession where over 90 per cent of its activities revolves around data cleaning and wrangling?

I'm happy my mentors drilled this concept of Data Science into my subconscious mind at the very beginning of my encounter with Data Science. The view that Data Cleaning and wrangling form a more significant part of Data Science practice has not only humbled me but has helped me to develop a growth mindset in my quest to learn and practice Data Science. To this end, the majority of the data in the wild will need a significant amount of restructuring before commencing a more detailed analysis.

My attention this week is turned to the concept and practice of data restructuring or 'tidy data'. Tidy data is one topic; in my opinion, critical to building a solid foundation in Data Science. Data restructure as a procedure has grown in lips and bounds so much so that in data science and analysis parlance, thanks to [Hadley Wickham]('http://vita.had.co.nz/papers/tidy-data.pdf') is now popularly referred to as **Tidy Data**



## What is Tidy Data?

The concept of Tidy Data will be best understood when demonstrated than theoretically explained. To this end, in the course of this piece, I will use examples from real-life data to make the topic more interactive and possibly hands-on.

According to the father of Tidy Data, a dataset is  messy if it does not meet the following guidelines:

- Each variable from a column
- Each observation from a row
- Each type of observational unit forms a table

### Messy or Untidy Data

Since the taste of the pudding is in the eating, we will display an example of messy data in the real-world.
Our dataset, for this post, is the record of students in private and public schools performance in the West African Examination Council (WAEC) Examination from the year 2016 to 2018.
This is how the data set look like when viewed with MS Excel:
![](my_icons/excel.png)



### Types of Messy Data
The first step to resolving messy data is to recognize it when it exists. However, Hadley explicitly mentioned five of the most common types of dirty data:

- Columns names are values, not variables names
- Multiple variables are stored in column names
- Variables are stored in both rows and columns
- Multiple types of observational units are stored in the same table
- A single observational unit is stored in multiple tables


### Tidy Data Procedure
Tidying data does not typically involve changing the value of a dataset, filling in missing values or doing any analysis. Tidying data consist of changing the **shape or structure of the data to meet the tidy data principle**


#### Read Data into Pandas
As customary, Pandas is the go-to tool for cleaning, restructuring, wrangling and visualizing data. Our goal, as far as this dataset is concerned, is to get for each year the private and public school performance of both sex and their respective states. Without doubt, to get our dataset from its present structure to the structure indicated above, we will perform some restructuring and tidying. Thus from the current dataset, we would extract the following variable/columns and variable values.

- States
- Year
- Sex
- Five_credit_and_above - Percentage of student that earned five credit and above 
 

In [18]:
import sys;sys.path.extend([r"/Users/user/anaconda3/envs/wrangling_data/lib/python3.7/site-packages"])
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

messy_data_private_2016 = pd.read_excel('./The-Data-Wrangling-Workshop/Chapter09/datasets/waec_2016_2018.xlsx', sheet_name='private 2016')
messy_data_public_2016 = pd.read_excel('./The-Data-Wrangling-Workshop/Chapter09/datasets/waec_2016_2018.xlsx', sheet_name='Public 2016')
messy_data_private_2017 = pd.read_excel('./The-Data-Wrangling-Workshop/Chapter09/datasets/waec_2016_2018.xlsx', sheet_name='Private 2017')
messy_data_public_2017 = pd.read_excel('./The-Data-Wrangling-Workshop/Chapter09/datasets/waec_2016_2018.xlsx', sheet_name='public 2017')
messy_data_private_2018 = pd.read_excel('./The-Data-Wrangling-Workshop/Chapter09/datasets/waec_2016_2018.xlsx', sheet_name='Private 2018')
messy_data_public_2018 = pd.read_excel('./The-Data-Wrangling-Workshop/Chapter09/datasets/waec_2016_2018.xlsx', sheet_name='Public 2018')



In [19]:
messy_data_private_2016.head()

Unnamed: 0,WEST AFRICAN EXAMINATIONS COUNCIL,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15
0,"RESULT STATISTICS FOR 5 CREDITS IN ENGLISH, MA...",,,,,,,,,,,,,,,
1,WASSCE PRIVATE CANDIDATES EXAMINATION 2016,,,,,,,,,,,,,,,
2,STATE,TOTAL NUMBER SAT,,,5 CREDITS AND ABOVE INCLUDING ENGLISH LANGUAGE,,,5 CREDITS AND ABOVE INCLUDING MATHEMATICS,,,5 CREDITS & ABOVE INCLUDING MATHEMATICS & ENGL...,,,% OF 5 CREDITS AND ABOVE,,
3,,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL
4,ABIA,309,276,585,65,59,124,39,27,66,35,25,60,11.3269,9.05797,10.2564


In [20]:
messy_data_public_2016.head()

Unnamed: 0,WEST AFRICAN EXAMINATIONS COUNCIL,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15
0,"RESULT STATISTICS FOR 5 CREDITS IN ENGLISH, MA...",,,,,,,,,,,,,,,
1,WASSCE CANDIDATES EXAMINATION 2016,,,,,,,,,,,,,,,
2,STATE,TOTAL NUMBER SAT,,,5 CREDITS AND ABOVE INCLUDING ENGLISH LANGUAGE,,,5 CREDITS AND ABOVE INCLUDING MATHEMATICS,,,5 CREDITS & ABOVE INCLUDING MATHEMATICS & ENGL...,,,% OF 5 CREDITS & ABOVE INCLUDING MATHEMATICS &...,,
3,,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL
4,ABIA,24322,27405,51727,20403,23564,43967,18485,20688,39173,17264,19526,36790,70.981,71.2498,71.1234


In [21]:
messy_data_private_2017.head()

Unnamed: 0,WEST AFRICAN EXAMINATIONS COUNCIL,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15
0,"RESULT STATISTICS FOR 5 CREDITS IN ENGLISH ,M...",,,,,,,,,,,,,,,
1,WASSCE PRIVATE CANDIDATES EXAMINATION 2017,,,,,,,,,,,,,,,
2,STATE,TOTAL NUMBER SAT,,,5 CREDITS AND ABOVE INCLUDING ENGLISH LANGUAGE,,,5 CREDITS AND ABOVE INCLUDING MATHEMATICS,,,5 CREDITS & ABOVE INCLUDING MATHEMATICS & ENGL...,,,% OF 5 CREDITS & ABOVE INCLUDING MATHEMATICS &...,,
3,,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL
4,ABIA,331,279,610,128,116,244,106,105,211,96,99,195,29.003,35.4839,31.9672


In [22]:
messy_data_public_2017.head()

Unnamed: 0,WEST AFRICAN EXAMINATIONS COUNCIL,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15
0,"RESULT STATISTICS FOR 5 CREDITS IN ENGLISH ,MA...",,,,,,,,,,,,,,,
1,WASSCE CANDIDATES EXAMINATION 2017,,,,,,,,,,,,,,,
2,STATE,TOTAL NUMBER SAT,,,5 CREDITS AND ABOVE INCLUDING ENGLISH LANGUAGE,,,5 CREDITS AND ABOVE INCLUDING MATHEMATICS,,,5 CREDITS & ABOVE INCLUDING MATHEMATICS & ENGL...,,,% OF 5 CREDITS & ABOVE INCLUDING MATHEMATICS &...,,
3,,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL
4,ABIA,23649,26023,49672,19317,21712,41029,20727,22931,43658,17928,20084,38012,75.8087,77.1779,76.526


In [23]:
messy_data_private_2018.head()

Unnamed: 0,WEST AFRICAN EXAMINATIONS COUNCIL,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 16374,Unnamed: 16375,Unnamed: 16376,Unnamed: 16377,Unnamed: 16378,Unnamed: 16379,Unnamed: 16380,Unnamed: 16381,Unnamed: 16382,Unnamed: 16383
0,"RESULT STATISTICS FOR 5 CREDITS IN ENGLISH ,M...",,,,,,,,,,...,,,,,,,,,,
1,WASSCE PRIVATE CANDIDATES EXAMINATION 2018,,,,,,,,,,...,,,,,,,,,,
2,STATE,TOTAL NUMBER SAT,,,5 CREDITS AND ABOVE INCLUDING ENGLISH LANGUAGE,,,5 CREDITS AND ABOVE INCLUDING MATHEMATICS,,,...,,,,,,,,,,
3,,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL,...,,,,,,,,,,
4,ABIA,308,291,599,176,176,352,172,170,342,...,,,,,,,,,,


In [24]:
messy_data_public_2018.head()

Unnamed: 0,WEST AFRICAN EXAMINATIONS COUNCIL,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15
0,"RESULT STATISTICS FOR 5 CREDITS IN ENGLISH ,M...",,,,,,,,,,,,,,,
1,WASSCE CANDIDATES EXAMINATION 2018,,,,,,,,,,,,,,,
2,STATE,TOTAL NUMBER SAT,,,5 CREDITS AND ABOVE INCLUDING ENGLISH LANGUAGE,,,5 CREDITS AND ABOVE INCLUDING MATHEMATICS,,,5 CREDITS & ABOVE INCLUDING MATHEMATICS & ENGL...,,,% of 5 CREDITS AND ABOVE INCLUDING MATHEMATICS...,,
3,,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL,MALE,FEMALE,TOTAL
4,ABIA,22502,24966,47468,19711,22056,41767,20239,22426,42665,18484,20572,39056,82.1438,82.4001,82.2786


#### Remove/skip Useless Values and Variables
From the above MS Excel display of the data we can see that we need to skip some rows of values that are not useful for our desired data structure

In [25]:
#private 2016
skip_rows_private_2016 = pd.read_excel('./The-Data-Wrangling-Workshop/Chapter09/datasets/waec_2016_2018.xlsx',\
                                       sheet_name='private 2016',header=None,skiprows=[0,1,2])
rename_col = skip_rows_private_2016.iloc[2:39,[0, 13,14]]
rename_col = rename_col.rename(columns={0:'States', 13:'Male',14:'Female'})
rename_col['School_type'] = 'private'
rename_col['Year'] = 2016


#public 2016
skip_rows_public_2016 = pd.read_excel('./The-Data-Wrangling-Workshop/Chapter09/datasets/waec_2016_2018.xlsx',\
                                      sheet_name='Public 2016',header=None, skiprows=[0,1,2])
rename_col_2016_b = skip_rows_public_2016.iloc[2:39,[0,13,14]]
rename_col_2016_b = rename_col_2016_b.rename(columns={0:'States', 13:'Male', 14:'Female'})
rename_col_2016_b['School_type'] = 'public'
rename_col_2016_b['Year'] = 2016



#private 2017
skip_rows_private_2017 = pd.read_excel('./The-Data-Wrangling-Workshop/Chapter09/datasets/waec_2016_2018.xlsx',\
                                       sheet_name='Private 2017',header=None,skiprows=[0,1,2])
rename_col_2017 = skip_rows_private_2017.iloc[2:39,[0, 13,14]]
rename_col_2017 = rename_col_2017.rename(columns={0:'States', 13:'Male',14:'Female'})
rename_col_2017['School_type'] = 'private'
rename_col_2017['Year'] = 2017


#public 2017
skip_rows_public_2017 = pd.read_excel('./The-Data-Wrangling-Workshop/Chapter09/datasets/waec_2016_2018.xlsx',\
                                      sheet_name='public 2017',header=None, skiprows=[0,1,2])
rename_col_2017_public = skip_rows_public_2017.iloc[2:39,[0,13,14]]
rename_col_2017_public = rename_col_2017_public.rename(columns={0:'States', 13:'Male', 14:'Female'})
rename_col_2017_public['School_type'] = 'public'
rename_col_2017_public['Year'] = 2017



#private 2018
skip_rows_private_2018 = pd.read_excel('./The-Data-Wrangling-Workshop/Chapter09/datasets/waec_2016_2018.xlsx',\
                                       sheet_name='Private 2018',header=None,skiprows=[0,1,2])
rename_col_2018_private = skip_rows_private_2018.iloc[2:39,[0, 13,14]]
rename_col_2018_private = rename_col_2018_private.rename(columns={0:'States', 13:'Male',14:'Female'})
rename_col_2018_private['School_type'] = 'private'
rename_col_2018_private['Year'] = 2018


#public 2018
skip_rows_public_2018 = pd.read_excel('./The-Data-Wrangling-Workshop/Chapter09/datasets/waec_2016_2018.xlsx',\
                                      sheet_name='Public 2018',header=None, skiprows=[0,1,2])
rename_col_2018_public = skip_rows_public_2018.iloc[2:39,[0,13,14]]
rename_col_2018_public = rename_col_2018_public.rename(columns={0:'States', 13:'Male', 14:'Female'})
rename_col_2018_public['School_type'] = 'public'
rename_col_2018_public['Year'] = 2018









#### Transform to Tidy Data
From the above current version of our dataset, it is evident that we have made significant progress. We have a 
data-frame of the percentages of males and females with their respective states that earned five credit and above. Going by the tidy data guideline, our data-frame need to undergo some structure adjustment and re-alignment. Since our goal is to merge all the data in all the sheet into one table, i.e. data-frame, other sheets would pass through similar procedures.

In [26]:
tidy_private_2016 = rename_col.melt(id_vars=['States'], value_vars=['Male', 'Female'],var_name='Sex',value_name='pct_five_credit_above')

In [27]:
tidy_private_2016

Unnamed: 0,States,Sex,pct_five_credit_above
0,ABIA,Male,11.3269
1,ABUJA,Male,6.53595
2,ADAMAWA,Male,6.42857
3,AKWA IBOM,Male,12.0308
4,ANAMBRA,Male,16.7679
...,...,...,...
69,RIVERS,Female,47.2096
70,SOKOTO,Female,12.9032
71,TARABA,Female,0
72,YOBE,Female,0


## Conclusion
From the above version of our dataset, we can see that the current version satisfies all the tidy data guidelines and principles. Thus our dataset is transformed from wide to long. We have in the course of this post demonstrated data cleaning, wrangling and restructuring. However, more activities can be still possible to clean and restructure to fine-tune the dataset further. Nevertheless, we would conclude this post here as we have achieved the objective of demonstrating how data tidying using pandas. I hope you have learned something today.
