# 1) Understand the difference between vertical tables and horizontal tables.

In the database world, tables are used to store data, so there are two different ways to structure these tables, among which are the Horizontal and Vertical structures. The horizontal structure is generally  based on the partitioning of the data in columns i.e. each column contains only part of the data. This kind of table is generally more used due to its easy of handling when compared to other structures of tables, but in some contexts i.e. deploying an app and then realize that it needs more data that are not in the table turning necessary adding a new column to tha dataframe might cause some problems and maybe will be necessary to re-structure thw whole deploy again which will be quite laborious to handle.

On the other hand, the vertical tables have the data and the variables structured in rows, which turns it easy to add new data into the dataframe making it a better option in contexts which are dynamic and work with a huge data volumn, so the kind of table that you'll use to work is highly dependent of the context of your goal, so it is recomended to check carefully what is the goal of your project? How dynamic is it? And based on these question use the kind of data structure that better fits on your project. In the classification context, Xie et al. (2013) have found that representations in horizontal  can robustly improve the performance.

## Reference

Xie, J; Xu, B; Chuang, Z. Horizontal and Vertical Ensemble with Deep Representation for
Classification. Presented at the ICML Workshop on Representation Learning, Atlanta, Georgia, USA, 2013.: https://arxiv.org/pdf/1306.2759.pdf



# 2) Find examples of vertical and horizontal tables

## About the data:

The data we're going to use were web scrapped from the site "https://www.politifact.com/factchecks" which is a site dedicated to classify news about politicy if they're real, fake, mostly true or mostly fake. The dataframe have columns indicating the author of the new, the statement column, which is the news itself, the source column, that is the site where the news was published, the date, that is the date which this news was posted, the target, that is the classification of the news,and the NinaryNumTarget that is the binary code used to classify the news in real or fake, with the number 1 representing the fake news and the number 0 the real news. 

The code used to scrap the data may be found in the following link: https://github.com/Viniciusfcfranca/Web_Scrapping_Polifact/blob/main/WebScrapping_news_csv.ipynb


In [1]:
#Importing the necessary packages
import numpy as np
import pandas as pd
df = pd.read_csv("fact_checker.csv")
#Showing the dataframe
df.head()

Unnamed: 0.1,Unnamed: 0,author,statement,source,date,target,BinaryNumTarget
0,0,Jeff Cercone,Washington’s State Board of Health will discus...,Facebook posts,"January 11, 2022",barely-true,1
1,1,Ciara O'Rourke,Says Supreme Court Justice Sonia Sotomayor was...,Facebook posts,"January 11, 2022",false,1
2,2,Madison Czopek,Walgreens refrigerators are scanning shoppers’...,Facebook posts,"January 11, 2022",pants-fire,1
3,3,Nina Baker,"Gov. Kim Reynolds, touting $210 million for Io...",Iowa Senate Democrats,"January 11, 2022",true,0
4,4,Samantha Putterman,“New Zealand okays euthanasia for COVID patien...,Bloggers,"January 11, 2022",barely-true,1


In [2]:
df = df.drop(['Unnamed: 0', 'target'], 1)

In [3]:
df.head()

Unnamed: 0,author,statement,source,date,BinaryNumTarget
0,Jeff Cercone,Washington’s State Board of Health will discus...,Facebook posts,"January 11, 2022",1
1,Ciara O'Rourke,Says Supreme Court Justice Sonia Sotomayor was...,Facebook posts,"January 11, 2022",1
2,Madison Czopek,Walgreens refrigerators are scanning shoppers’...,Facebook posts,"January 11, 2022",1
3,Nina Baker,"Gov. Kim Reynolds, touting $210 million for Io...",Iowa Senate Democrats,"January 11, 2022",0
4,Samantha Putterman,“New Zealand okays euthanasia for COVID patien...,Bloggers,"January 11, 2022",1


This dataframe is an example of horizontal dataframe where each column represents part of the total data, spliting it in columns. Now let's see an example of how would be a vertical dataframe.

In [4]:
df2 = df.set_index('BinaryNumTarget').stack().reset_index(name='values') #using the function stack in the original dataframe
df2 = pd.DataFrame(df2) #converting to dataframe
df2

Unnamed: 0,BinaryNumTarget,level_1,values
0,1,author,Jeff Cercone
1,1,statement,Washington’s State Board of Health will discus...
2,1,source,Facebook posts
3,1,date,"January 11, 2022"
4,1,author,Ciara O'Rourke
...,...,...,...
9355,1,date,"February 19, 2021"
9356,0,author,Amy Sherman
9357,0,statement,"“Energy experts and State House Dems, among ot..."
9358,0,source,Beto O'Rourke


In [5]:
df2 = df2.rename(columns={'level_1': 'variables'})

In [6]:
df2

Unnamed: 0,BinaryNumTarget,variables,values
0,1,author,Jeff Cercone
1,1,statement,Washington’s State Board of Health will discus...
2,1,source,Facebook posts
3,1,date,"January 11, 2022"
4,1,author,Ciara O'Rourke
...,...,...,...
9355,1,date,"February 19, 2021"
9356,0,author,Amy Sherman
9357,0,statement,"“Energy experts and State House Dems, among ot..."
9358,0,source,Beto O'Rourke


# 3) Choose a categorical variable in the vertical basis and use the 'transpose' funtion in the 'Pandas' package to transpose the table and create a contigency table.

In [7]:
df2_transposed = df2.set_index('variables').transpose()
df2_transposed

variables,author,statement,source,date,author.1,statement.1,source.1,date.1,author.2,statement.2,...,source.2,date.2,author.3,statement.3,source.3,date.3,author.4,statement.4,source.4,date.4
BinaryNumTarget,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,0,0,0,0
values,Jeff Cercone,Washington’s State Board of Health will discus...,Facebook posts,"January 11, 2022",Ciara O'Rourke,Says Supreme Court Justice Sonia Sotomayor was...,Facebook posts,"January 11, 2022",Madison Czopek,Walgreens refrigerators are scanning shoppers’...,...,Viral image,"February 19, 2021",Madison Czopek,"“If you make $50,000/year, $36 of your taxes g...",Facebook posts,"February 19, 2021",Amy Sherman,"“Energy experts and State House Dems, among ot...",Beto O'Rourke,"February 19, 2021"


# 4) Understand the concept of contingency table.

In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering, and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them.

## Reference

Karl Pearson, F.R.S. (1904). Mathematical contributions to the theory of evolution. Dulau and Co.: https://archive.org/details/cu31924003064833

# 5) Study the function melt and apply in the data.

## Melt

The melt function is useful to massage a DataFrame into a format where one or more columns are identified as variables (id_vars), while all other columns, considered measured variables (value_vars), leaving just two non-identifier columns, ‘variable’ and ‘value’.

In [8]:
df.melt(id_vars=['BinaryNumTarget'])

Unnamed: 0,BinaryNumTarget,variable,value
0,1,author,Jeff Cercone
1,1,author,Ciara O'Rourke
2,1,author,Madison Czopek
3,0,author,Nina Baker
4,1,author,Samantha Putterman
...,...,...,...
9355,1,date,"February 19, 2021"
9356,1,date,"February 19, 2021"
9357,1,date,"February 19, 2021"
9358,1,date,"February 19, 2021"


In [9]:
df2.melt(id_vars=['variables'], value_vars=['values'])

Unnamed: 0,variables,variable,value
0,author,values,Jeff Cercone
1,statement,values,Washington’s State Board of Health will discus...
2,source,values,Facebook posts
3,date,values,"January 11, 2022"
4,author,values,Ciara O'Rourke
...,...,...,...
9355,date,values,"February 19, 2021"
9356,author,values,Amy Sherman
9357,statement,values,"“Energy experts and State House Dems, among ot..."
9358,source,values,Beto O'Rourke


# 6) Use the transpose function in SQL to achieve the same result as the Pandas transpose.

### At first, let's turn this dataframe into a SQL table, so at first we're going to convert it into an excell file and then import to the mysql

In [10]:
df2.to_excel('fact_checker.xlsx')

Right now we're going to import it into a sql schema.

Into the mysql program, we select one of the schemas we have, then we'd clicked in tables section and after this we had clicked with the right bottom of the mouse and clicked in 'create new table'. It openned a new window in te sql to we create the table and name the columns. We'd named the table as 'fact_checker' and named the columns with the same names of the columns in the dataframe then created applied to create the table.

After this, we clicked with the right bottom in the new table we had created and choose the option "table data import wizard" to import the file to the mysql and after this, our table was filled with data.

After that I used the following querry in Mysql:

SELECT  "values",


        MAX(CASE WHEN BinaryNumTarget = 0 THEN "variables" END) AS 'author',
        MAX(CASE WHEN BinaryNumTarget = 0 THEN "variables" END) AS 'statement',
        MAX(CASE WHEN BinaryNumTarget = 0 THEN "variables" END) AS 'source',
        MAX(CASE WHEN BinaryNumTarget = 0 THEN "variables" END) AS 'date',
        MAX(CASE WHEN BinaryNumTarget = 1 THEN "variables" END) AS 'author',
        MAX(CASE WHEN BinaryNumTarget = 1 THEN "variables" END) AS 'statement',
        MAX(CASE WHEN BinaryNumTarget = 1 THEN "variables" END) AS 'source',
        MAX(CASE WHEN BinaryNumTarget = 1 THEN "variables" END) AS 'date'


FROM    world.fact_checker


ORDER   BY "values";

Which returned to me:



| variables | author | statement | source | date | author | ... | source | date |
|--- |--- |--- |--- |--- |--- |--- |--- |--- |
| BinaryNumTarget| 1 | 1 | 1 | 1 |1 | ... | 1 | 0 | 
| values | Jeff Cercone | Washington’s State Board of Health will discus... | Facebook posts | January 11, 2022 |Ciara O'Rourke | ...| Facebook posts | January 11, 2022 | 
