## Reto 5: Análisis Exploratorio de Datos

### 1. Objetivos:
    - Practicar leer archivos JSON usando pandas
    - Practicar hacerse preguntas acerca de los conjuntos de datos que tenemos
 
---
    
### 2. Desarrollo:

Vamos a practicar explorar conjuntos de datos y hacernos preguntas acerca de ellos.

Tenemos un conjunto de datos en formato JSON almacenado en '../../Datasets/new_york_times_bestsellers-clean.json' (en la carpeta /Datasets en el directorio raíz del módulo).

Primero que nada, lee el archivo JSON y crea un `DataFrame` con él:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

#/content/drive/MyDrive/DataAnalysis/Datasets/new_york_times_bestsellers-clean.json

Mounted at /content/drive


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
## Realiza aquí los imports que necesites
import pandas as pd
import json 
import datetime

## Lee aquí tu archivo JSON
data = open('/content/drive/MyDrive/DataAnalysis/Datasets/new_york_times_bestsellers-clean.json', 'r')
json_data = json.load(data)
data.close()

## Crea aquí tu DataFrame
df = pd.DataFrame.from_dict(json_data)

df.head(5)

Unnamed: 0,amazon_product_url,author,description,publisher,title,oid,bestsellers_date.numberLong,published_date.numberLong,rank.numberInt,rank_last_week.numberInt,weeks_on_list.numberInt,price.numberDouble
0,http://www.amazon.com/The-Host-Novel-Stephenie...,Stephenie Meyer,Aliens have taken control of the minds and bod...,"Little, Brown",THE HOST,5b4aa4ead3089013507db18c,1211587200000,1212883200000,2,1,3,25.99
1,http://www.amazon.com/Love-Youre-With-Emily-Gi...,Emily Giffin,A woman's happy marriage is shaken when she en...,St. Martin's,LOVE THE ONE YOU'RE WITH,5b4aa4ead3089013507db18d,1211587200000,1212883200000,3,2,2,24.95
2,http://www.amazon.com/The-Front-Garano-Patrici...,Patricia Cornwell,A Massachusetts state investigator and his tea...,Putnam,THE FRONT,5b4aa4ead3089013507db18e,1211587200000,1212883200000,4,0,1,22.95
3,http://www.amazon.com/Snuff-Chuck-Palahniuk/dp...,Chuck Palahniuk,An aging porn queens aims to cap her career by...,Doubleday,SNUFF,5b4aa4ead3089013507db18f,1211587200000,1212883200000,5,0,1,24.95
4,http://www.amazon.com/Sundays-at-Tiffanys-Jame...,James Patterson and Gabrielle Charbonnet,A woman finds an unexpected love,"Little, Brown",SUNDAYS AT TIFFANY’S,5b4aa4ead3089013507db190,1211587200000,1212883200000,6,3,4,24.99


Ahora, usando todas las herramientas que hemos aprendido en esta sesión (indexación de filas y columnas, `shape`, `dtypes`, `head`, `tail`, `columns`, `info`, etc) explora tu dataset y debate con el experto y tus compañeros las siguientes preguntas:

#### 1: ¿Qué podemos saber de este dataset con tan sólo leer el nombre del archivo y los nombres de las columnas?

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3033 entries, 0 to 3032
Data columns (total 12 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   amazon_product_url           3033 non-null   object 
 1   author                       3033 non-null   object 
 2   description                  3033 non-null   object 
 3   publisher                    3033 non-null   object 
 4   title                        3033 non-null   object 
 5   oid                          3033 non-null   object 
 6   bestsellers_date.numberLong  3033 non-null   int64  
 7   published_date.numberLong    3033 non-null   int64  
 8   rank.numberInt               3033 non-null   int64  
 9   rank_last_week.numberInt     3033 non-null   int64  
 10  weeks_on_list.numberInt      3033 non-null   int64  
 11  price.numberDouble           3033 non-null   float64
dtypes: float64(1), int64(5), object(6)
memory usage: 308.0+ KB


#### 2: ¿Cuál es el tamaño (forma) de nuestro DataFrame? ¿Podríamos considerarlo un dataset grande o pequeño?

In [5]:
df.shape

(3033, 12)

#### 3: ¿Crees que los nombres de las columnas son suficientemente descriptivos? ¿Podrían ser más claros o limpios?

In [6]:
df.columns

Index(['amazon_product_url', 'author', 'description', 'publisher', 'title',
       'oid', 'bestsellers_date.numberLong', 'published_date.numberLong',
       'rank.numberInt', 'rank_last_week.numberInt', 'weeks_on_list.numberInt',
       'price.numberDouble'],
      dtype='object')

#### 4: ¿Qué tipos de datos tenemos?

In [7]:
df.dtypes

amazon_product_url              object
author                          object
description                     object
publisher                       object
title                           object
oid                             object
bestsellers_date.numberLong      int64
published_date.numberLong        int64
rank.numberInt                   int64
rank_last_week.numberInt         int64
weeks_on_list.numberInt          int64
price.numberDouble             float64
dtype: object

##### 5: ¿Qué significa el formato en el que tenemos las fechas?

In [8]:

pd.to_datetime(df['published_date.numberLong'], unit='ms')


0      2008-06-08
1      2008-06-08
2      2008-06-08
3      2008-06-08
4      2008-06-08
          ...    
3028   2013-05-05
3029   2013-05-05
3030   2013-05-05
3031   2013-05-05
3032   2013-05-05
Name: published_date.numberLong, Length: 3033, dtype: datetime64[ns]

#### 6: ¿Qué tipo de preguntas podríamos responder usando los datos numéricos de este dataset?


a) Autor mejor calificado

In [9]:
porautor = df[['author', 'rank.numberInt']]
porautor = porautor.groupby(porautor['author']).mean()
porautor.sort_values(by=['rank.numberInt', 'author'], ascending=False)

Unnamed: 0_level_0,rank.numberInt
author,Unnamed: 1_level_1
Ron Rash,16.000000
Raymond E Feist,16.000000
Maggie Shipstead,16.000000
Kimberly McCreight,16.000000
Jennifer Crusie,16.000000
...,...
Pat Conroy,4.666667
James Patterson and David Ellis,4.000000
Dan Brown,3.896552
Patrick Rothfuss,3.000000


b) Libro con más tiempo en la lista

In [10]:
librosemanas = df[['title', 'weeks_on_list.numberInt']]
librosemanas = librosemanas.groupby(librosemanas['title']).sum()
librosemanas.sort_values(by=['weeks_on_list.numberInt'], ascending=False)

Unnamed: 0_level_0,weeks_on_list.numberInt
title,Unnamed: 1_level_1
THE HELP,5886
THE GIRL WHO KICKED THE HORNET’S NEST,3160
THE HOST,1767
THE STORY OF EDGAR SAWTELLE,780
THE LOST SYMBOL,435
...,...
THE LAST SURGEON,1
THE LAST THRESHOLD,1
THE LAUGHTER OF DEAD KINGS,1
THE LAW OF NINES,1


c) Libro más caro

In [11]:
librosporprecio = df[['title', 'price.numberDouble']].max()
librosporprecio

title                   ZOO
price.numberDouble    34.99
dtype: object

#### 7: ¿Qué tipo de preguntas podríamos responder usando los datos no-numéricos?

a) Libro con el título más largo

In [12]:
titulos = df['title'].to_frame()
titulos['len'] = titulos['title'].str.len()

titulos.sort_values(by=['len'], ascending=False)

Unnamed: 0,title,len
1908,THE LOST FLEET. BEYOND THE FRONTIER: DREADNAUGHT,48
2493,THE LIMPOPO ACADEMY OF PRIVATE DETECTION,40
2451,THE LIMPOPO ACADEMY OF PRIVATE DETECTION,40
2480,THE LIMPOPO ACADEMY OF PRIVATE DETECTION,40
2508,THE LIMPOPO ACADEMY OF PRIVATE DETECTION,40
...,...,...
2680,NW,2
2569,XO,2
3008,Z,1
2998,Z,1


b) Libros con "X" en el nombre

In [13]:
busca = 'LORD'
df[df['title'].str.contains(busca)].head(5)

Unnamed: 0,amazon_product_url,author,description,publisher,title,oid,bestsellers_date.numberLong,published_date.numberLong,rank.numberInt,rank_last_week.numberInt,weeks_on_list.numberInt,price.numberDouble
984,http://www.amazon.com/First-Lords-Fury-Codex-A...,Jim Butcher,"With their survival at stake, Alerans prepare ...",Ace,FIRST LORD’S FURY,5b4aa4ead3089013507db7bd,1259366400000,1260662400000,7,0,1,25.95
1507,http://www.amazon.com/Warlord-New-Alex-Hawke-N...,Ted Bell,The counterspy Alex Hawke races to stop a madm...,Morrow/HarperCollins,WARLORD,5b4aa4ead3089013507dbb07,1284854400000,1286064000000,9,0,1,27.99
1523,http://www.amazon.com/Warlord-New-Alex-Hawke-N...,Ted Bell,The counterspy Alex Hawke races to stop a madm...,Morrow/HarperCollins,WARLORD,5b4aa4ead3089013507dbb1f,1285459200000,1286668800000,13,9,2,27.99
2683,http://www.amazon.com/Lord-Mountains-Novel-Cha...,S M Stirling,"Further adventures in a postapocalyptic America,",Roc,LORD OF MOUNTAINS,5b4aa4ead3089013507dc317,1347062400000,1348358400000,13,0,1,27.95


c) Libros del Autor "X"

In [14]:
autor = 'Patricia'
df[df['author'].str.contains(autor)].head(5)

Unnamed: 0,amazon_product_url,author,description,publisher,title,oid,bestsellers_date.numberLong,published_date.numberLong,rank.numberInt,rank_last_week.numberInt,weeks_on_list.numberInt,price.numberDouble
2,http://www.amazon.com/The-Front-Garano-Patrici...,Patricia Cornwell,A Massachusetts state investigator and his tea...,Putnam,THE FRONT,5b4aa4ead3089013507db18e,1211587200000,1212883200000,4,0,1,22.95
17,http://www.amazon.com/The-Front-Garano-Patrici...,Patricia Cornwell,A Massachusetts state investigator and his tea...,Putnam,THE FRONT,5b4aa4ead3089013507db1a5,1212192000000,1213488000000,7,4,2,22.95
36,http://www.amazon.com/The-Front-Garano-Patrici...,Patricia Cornwell,A Massachusetts state investigator and his tea...,Putnam,THE FRONT,5b4aa4ead3089013507db1c0,1212796800000,1214092800000,14,7,3,22.95
337,http://www.amazon.com/Scarpetta-Kay-Patricia-C...,Patricia Cornwell,The forensic pathologist Kay Scarpetta takes a...,Putnam,SCARPETTA,5b4aa4ead3089013507db3bb,1228521600000,1229817600000,1,0,1,27.95
348,http://www.amazon.com/Scarpetta-Kay-Patricia-C...,Patricia Cornwell,The forensic pathologist Kay Scarpetta takes a...,Putnam,SCARPETTA,5b4aa4ead3089013507db3cf,1229126400000,1230422400000,1,1,2,27.95
