# Exploratory Data Analysis -> Factored *Datathon* 2023
## *Raw Data*

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves examining and summarizing data to gain insights and understanding before formal modeling or hypothesis testing. It is essential for several reasons: 

- Data Understanding
- Data Cleaning
- Feature Selection
- Pattern Recognition
- Outlier Detection

This document serves as the summary of the different EDA that we made along the journey that was the datathon. Which question were asked and how those questions led to new business and product desicions that created the final product.

In [1]:
# Import seaborn
import seaborn as sns
import pandas as pd

# Apply the default theme
sns.set_theme()

In [3]:
%%bigquery general_df

SELECT * FROM `plenary-stacker-393921.factored.reviews`  LIMIT 1000

Query is running:   0%|          |

Downloading:   0%|          |

In [6]:
general_df.columns

Index(['asin', 'overall', 'reviewText', 'reviewerID', 'reviewerName',
       'summary', 'unixReviewTime', 'verified', 'style', 'vote', 'image',
       'batch', 'batch_file'],
      dtype='object')

In [8]:
%%bigquery meta_df

SELECT * FROM `plenary-stacker-393921.factored.metadata`  LIMIT 1000

Query is running:   0%|          |

Downloading:   0%|          |

In [9]:
meta_df.columns

Index(['asin', 'also_buy', 'also_view', 'brand', 'category', 'date',
       'description', 'details', 'feature', 'fit', 'image', 'main_cat',
       'price', 'rank', 'similar_item', 'tech1', 'tech2', 'title', 'batch',
       'batch_file'],
      dtype='object')

## ¿What was the data?
#### General Description

The challenge consisted on two datasets split in two unequal halves: One of the datasets are Amazon reviews of different products, while the second one is the metadata associated to those each one of those products. The data was mostly received by batch, however there is a significant chunk that was received via streaming. Data received via streaming was mostly like the data received via batch. 

#### How data was received and organized

Data was received via nested json that were flattened into tabular objects: The keys represented the main columns of the data, and the values the information stored in rows. Our team ingested the data into a non-relational database backend in GCP called Bigquery where we created structures suitable for ML training and pattern recognition. We kept the division between Metadata and Data, still we did run analysis linking both domains


##### Reviews

Our organized data had the following columns:

- 'asin': Unique ID of the object reviewed.
- 'overall': Vote that reviewer left on the review.
- 'reviewText': Review.
- 'reviewerID': Reviewer ID.
- 'reviewerName': Self declarated name of the review.
- 'summary': Title of the review.
- 'unixReviewTime': Time of the review.
- 'verified': Amazon knows if the reviewer is a real person.
- 'style': What is the kind of review.
- 'vote': Votes weighting the quality of the review.
- 'image: Link to the image - All were null.
- 'batch': Epoch where it was called.
- 'batch_file':  File where it was retreived.


##### Uniqueness

To create a unique element for analysis, three elements have to be concatenated *ASIN + reviewerID + unixReviewTime* Why?
- Objects could have multiple reviews by the same person
- Persons could leave multiple reviews in the same object


In [4]:
%bigquery_stats plenary-stacker-393921.factored.reviews

Getting table schema...: 100%|██████████| 100/100 [00:01<00:00, 54.34it/s]
Querying data in column 'batch_file': : 13it [00:10,  1.26it/s]   
Retrieve stats for 'batch_file': 100%|██████████| 13/13 [00:47<00:00,  3.64s/it]   


##### Metadata

Our organized data had the following columns:

- ''asin': Unique ID of the object reviewed.
- 'also_buy': What other items are bundled
- 'also_view': What items are often viewed 
- 'brand': Brand of the object
- 'category': Business vertical of the object
- 'date: Date of creation of the description
- 'description': Description of the object
- 'details': Object elements
- 'feature': Special characteristic of the object
- 'fit': HTML feaure of the object
- 'image': URL of the image photo
- 'main_cat: Main Category of the object for amazon
- 'price': Retail Price
- 'rank': Position of sales in the main category
- 'similar_item': Similar items
- 'title': Name of the object


In [11]:
%bigquery_stats   plenary-stacker-393921.factored.meta_flat

Getting table schema...: 100%|██████████| 100/100 [00:01<00:00, 55.77it/s]
Querying data in column 'title': : 12it [00:15,  1.27s/it]       
  fig.savefig(img, format='png', bbox_inches='tight')
Retrieve stats for 'title': 100%|██████████| 12/12 [03:01<00:00, 15.15s/it]       


### Conclusions

- Data is massive and it won't be worthy to use it all for a time constrained challenge
- There are tons of formatting issues related to each column, trying to explore 1 by 1 won't be efficient
- Fashion is a giant category, and it is the Pareto category, therefore is the chose one for the hackathon challenges.