# Práctica 2: Limpieza y análisis de datos
## Autores
Hemos realizado esta práctica:
* Ignacio Such Ballester
* Andrés Isidro Fonts Santana

## 1. Descripción del _dataset_
### 1.1 Contexto
Se pretende sacar al mercado un nuevo juego de mesa lo más existoso posible y convertirlo en un bestseller.

Para ello, hemos escogido el _dataset_ [Board Game Data](https://www.kaggle.com/datasets/mrpantherson/board-game-data?select=bgg_db_2018_01.csv), disponible en la plataforma Kaggle.

Este conjunto de datos se ha extraído mediante la API del portal [Board Games Geek](https://boardgamegeek.com/). El _dataset_ se generó en enero de 2018 y contiene datos sobre los primeros 5000 juegos de mesa del _ranking_ de Board Games Geek. 

A través de este set de datos, podemos realizar un análisis profundo del mismo, obteniendo correlaciones, clasificaciones en incluso predicciones para averigurar cómo diseñar nuestro juego de mesa.

### 1.2 Descripción de los atributos 
Cada uno de los 5000 registros con que cuenta al _dataset_ viene determinado por 20 attributos:

| Nombre      | Tipo | Descripción | Ejemplo
|-------------|------|-------------|-------------------------
| rank        | | |
| bgg_url     | | |
| names       | | |
| min_players | | |
| max_players | | |
| avg_time    | | |
| min_time    | | |
| max_time    | | |
| year        | | |
| avg_rating  | | |
| geek_rating | | |
| num_votes   | | |
| image_url   | | |
| age         | | |
| mechanic    | | |
| owned       | | |
| category    | | |
| designer    | | |
| weight      | | |


## 2. Selección de los datos

Además, se podrá proceder a crear modelos de regresión que permitan predecir si un juego será un bestseller o no en función de sus características y contrastes de hipótesis que ayuden a identificar propiedades interesantes en las muestras.


## 3. Limpieza de los datos

In [2]:
# Importamos la librería pandas
import pandas as pd

In [4]:
bgg=pd.read_csv('../csv/bgg_db_2018_01.csv',sep=',',encoding='latin-1')
#bgg=pd.read_csv('C:/Users/ignac/Documents/GitHub/bgg-clean-and-analysis/csv/bgg_db_2018_01.csv',sep=',',encoding='latin-1')

# Show 5 rows of the dataframe
bgg.head()

# Show number of rows in the dataframe
bgg.shape

(4999, 20)

In [594]:
# Show if dataframes has NA values
bgg.isnull().sum()

# Vemos que en el dataset no existen valores nulos

# Show 0 values of the dataframe

rank           0
bgg_url        0
game_id        0
names          0
min_players    0
max_players    0
avg_time       0
min_time       0
max_time       0
year           0
avg_rating     0
geek_rating    0
num_votes      0
image_url      0
age            0
mechanic       0
owned          0
category       0
designer       0
weight         0
dtype: int64

In [595]:
# Show extreme values with 2 decimals
bgg.describe().round(2)

# Aquí se puede ver que para algunas observaciones, encontramos juegos que tienen valores extremos, como son en las variables avg_time, min_time y max_time. También para la variable max_players, se encuentran valores extremos, como son en el caso de que el valor sea 0.

Unnamed: 0,rank,game_id,min_players,max_players,avg_time,min_time,max_time,year,avg_rating,geek_rating,num_votes,age,owned,weight
count,4999.0,4999.0,4999.0,4999.0,4999.0,4999.0,4999.0,4999.0,4999.0,4999.0,4999.0,4999.0,4999.0,4999.0
mean,2500.0,84623.39,2.03,5.38,115.24,85.15,114.83,1997.74,6.96,6.08,1899.07,10.36,2881.6,2.35
std,1443.23,74844.22,0.68,16.08,509.8,317.59,509.85,140.36,0.56,0.48,4516.59,3.28,6133.48,0.8
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,-3000.0,5.77,5.64,55.0,0.0,46.0,0.0
25%,1250.5,10304.5,2.0,4.0,30.0,30.0,30.0,2003.0,6.53,5.72,267.0,8.0,588.0,1.74
50%,2500.0,66116.0,2.0,4.0,60.0,45.0,60.0,2010.0,6.9,5.9,581.0,11.0,1123.0,2.29
75%,3749.5,155214.0,2.0,6.0,120.0,90.0,120.0,2014.0,7.33,6.29,1526.5,12.0,2570.0,2.88
max,4999.0,237087.0,8.0,999.0,22500.0,17280.0,22500.0,2018.0,9.26,8.52,74261.0,42.0,106608.0,4.9


In [596]:
# Show row where is max value of avg_time
bgg[bgg['avg_time']==bgg['avg_time'].max()]

Unnamed: 0,rank,bgg_url,game_id,names,min_players,max_players,avg_time,min_time,max_time,year,avg_rating,geek_rating,num_votes,image_url,age,mechanic,owned,category,designer,weight
2209,2210,https://boardgamegeek.com/boardgame/29285/case-blue,29285,Case Blue,1,2,22500,0,22500,2007,8.21402,5.96182,262,https://cf.geekdo-images.com/images/pic206547.jpg,12,"Dice Rolling, Hex-and-Counter, Simulation",642,"Wargame, World War II",Dean Essig,4.5821


In [597]:
# Show rows where max_players equals 0
bgg[bgg['max_players']==0]

Unnamed: 0,rank,bgg_url,game_id,names,min_players,max_players,avg_time,min_time,max_time,year,avg_rating,geek_rating,num_votes,image_url,age,mechanic,owned,category,designer,weight
1766,1766,https://boardgamegeek.com/boardgame/37301/decktet,37301,Decktet,0,0,30,30,30,2008,7.50923,6.08737,430,https://cf.geekdo-images.com/images/pic353574.jpg,0,none,1344,"Card Game, Game System, Print & Play",P. D. Magnus,1.9655
2164,2165,https://boardgamegeek.com/boardgame/18291/unpublished-prototype,18291,Unpublished Prototype,0,0,0,0,0,0,6.9737,5.96984,577,https://cf.geekdo-images.com/images/pic116113.jpg,0,none,881,none,(Uncredited),2.4
2475,2476,https://boardgamegeek.com/boardgame/23953/outside-scope-bgg,23953,Outside the Scope of BGG,0,0,0,0,0,0,6.73655,5.90572,516,https://cf.geekdo-images.com/images/pic193671.jpg,0,none,2190,none,(Uncredited),1.6582
2523,2524,https://boardgamegeek.com/boardgame/21804/traditional-card-games,21804,Traditional Card Games,0,0,0,0,0,0,6.52588,5.89507,689,https://cf.geekdo-images.com/images/pic111209.jpg,0,none,1203,"Card Game, Game System",(Uncredited),2.0169
2760,2761,https://boardgamegeek.com/boardgame/85204/kings-war,85204,Kings of War,2,0,60,60,60,2010,7.80907,5.85244,214,https://cf.geekdo-images.com/images/pic2619704.jpg,0,"Dice Rolling, Variable Player Powers",448,"Book, Fantasy, Miniatures, Wargame",Alessio Cavatore,2.5
2971,2972,https://boardgamegeek.com/boardgame/621/25-words-or-less,621,25 Words or Less,4,0,60,60,60,1996,6.56378,5.81638,484,https://cf.geekdo-images.com/images/pic195792.jpg,13,"Auction/Bidding, Partnerships",673,"Party Game, Word Game",Bruce Sterten,1.5333
3027,3028,https://boardgamegeek.com/boardgame/37672/warhammer-40000-assault-black-reach,37672,"Warhammer 40,000: Assault On Black Reach",2,0,240,240,240,2008,6.9625,5.80767,285,https://cf.geekdo-images.com/images/pic358486.jpg,0,"Dice Rolling, Modular Board, Variable Player Powers",737,"Fighting, Miniatures, Science Fiction, Wargame",Alessio Cavatore,3.1739
3107,3108,https://boardgamegeek.com/boardgame/85652/dystopian-wars,85652,Dystopian Wars,2,0,240,120,240,2010,7.30986,5.79765,218,https://cf.geekdo-images.com/images/pic902315.jpg,12,"Dice Rolling, Point to Point Movement, Variable Player Powers",477,"Miniatures, Nautical, Post-Napoleonic, Science Fiction, Wargame, World War I","Neil Fawcett, James Flack, Julian Glover, Alain Padfield, Franco Sammarco, Derek Sinclair",3.1304
3510,3511,https://boardgamegeek.com/boardgame/195242/tanks-panther-vs-sherman,195242,Tanks: Panther vs Sherman,2,0,60,0,60,2016,7.40312,5.74489,208,https://cf.geekdo-images.com/images/pic2933710.jpg,0,"Action Point Allowance System, Dice Rolling, Variable Player Powers",567,"Collectible Components, Miniatures, Wargame, World War II","Andrew Haught, Chris Townley, Phil Yates",1.8462
3633,3634,https://boardgamegeek.com/boardgame/25738/big-taboo,25738,The Big Taboo,4,0,0,0,0,2006,6.36553,5.73292,438,https://cf.geekdo-images.com/images/pic392560.jpg,12,"Acting, Memory, Paper-and-Pencil, Partnerships",843,"Action / Dexterity, Memory, Party Game, Word Game",Brian Hersch,1.6071


In [598]:
# Show value table of max_players
bgg.max_players.value_counts()

4      1635
2      955 
5      902 
6      795 
8      233 
7      92  
1      77  
10     72  
3      64  
12     49  
0      25  
99     24  
9      19  
20     12  
16     10  
18     8   
15     6   
30     5   
24     2   
11     2   
21     2   
52     1   
33     1   
200    1   
50     1   
13     1   
34     1   
100    1   
75     1   
68     1   
999    1   
Name: max_players, dtype: int64

In [599]:
# Show rows where min_players equals 0
bgg[bgg['min_players']==0]

Unnamed: 0,rank,bgg_url,game_id,names,min_players,max_players,avg_time,min_time,max_time,year,avg_rating,geek_rating,num_votes,image_url,age,mechanic,owned,category,designer,weight
1766,1766,https://boardgamegeek.com/boardgame/37301/decktet,37301,Decktet,0,0,30,30,30,2008,7.50923,6.08737,430,https://cf.geekdo-images.com/images/pic353574.jpg,0,none,1344,"Card Game, Game System, Print & Play",P. D. Magnus,1.9655
2164,2165,https://boardgamegeek.com/boardgame/18291/unpublished-prototype,18291,Unpublished Prototype,0,0,0,0,0,0,6.9737,5.96984,577,https://cf.geekdo-images.com/images/pic116113.jpg,0,none,881,none,(Uncredited),2.4
2475,2476,https://boardgamegeek.com/boardgame/23953/outside-scope-bgg,23953,Outside the Scope of BGG,0,0,0,0,0,0,6.73655,5.90572,516,https://cf.geekdo-images.com/images/pic193671.jpg,0,none,2190,none,(Uncredited),1.6582
2523,2524,https://boardgamegeek.com/boardgame/21804/traditional-card-games,21804,Traditional Card Games,0,0,0,0,0,0,6.52588,5.89507,689,https://cf.geekdo-images.com/images/pic111209.jpg,0,none,1203,"Card Game, Game System",(Uncredited),2.0169
2792,2793,https://boardgamegeek.com/boardgame/99358/stonewall-jacksons-way-ii,99358,Stonewall Jackson's Way II,0,2,720,0,720,2013,8.46483,5.84644,145,https://cf.geekdo-images.com/images/pic1693847.png,0,"Dice Rolling, Hex-and-Counter",602,"American Civil War, Wargame","Joseph M. Balkoski, Ed Beach, Mike Belles, Chris Withers",3.7895
3639,3640,https://boardgamegeek.com/boardgame/5985/miscellaneous-game-accessory,5985,Miscellaneous Game Accessory,0,0,0,0,0,0,6.93448,5.7327,212,https://cf.geekdo-images.com/images/pic1017967.jpg,0,none,1056,none,(Uncredited),3.3333
4192,4193,https://boardgamegeek.com/boardgame/2860/piecepack,2860,Piecepack,0,0,10,10,10,2001,7.10008,5.684,146,https://cf.geekdo-images.com/images/pic119215.jpg,5,none,522,Game System,James Kyle,2.4
4553,4554,https://boardgamegeek.com/boardgame/62214/aspern-essling-1809,62214,Aspern-Essling 1809,0,2,240,0,240,2009,7.95573,5.66053,96,https://cf.geekdo-images.com/images/pic606539.jpg,0,"Chit-Pull System, Hex-and-Counter",306,"Napoleonic, Wargame",Frédéric Bey,2.7826
4557,4558,https://boardgamegeek.com/boardgame/10904/new-rules-classic-games,10904,New Rules for Classic Games,0,0,0,0,0,1992,7.49868,5.66021,88,https://cf.geekdo-images.com/images/pic1514261.jpg,10,none,196,"Abstract Strategy, Action / Dexterity, Book, Card Game, Deduction, Dice, Negotiation, Word Game",R. Wayne Schmittberger,2.0
4774,4775,https://boardgamegeek.com/boardgame/4292/sword-and-flame,4292,The Sword and the Flame,0,0,60,60,60,1979,7.47105,5.64826,95,https://cf.geekdo-images.com/images/pic31648.jpg,12,none,219,"Miniatures, Wargame",Larry V. Brom,2.2857


In [600]:
# Show value table of min_players
bgg.min_players.value_counts()

2    3413
1    799 
3    642 
4    108 
5    15  
0    11  
6    6   
8    5   
Name: min_players, dtype: int64

Vemos que en diferentes variables tenemos valores extremos y valores que no concuerdan con el dataset, como por ejemplo que el tiempo medio de una partida sean 27000 minutos.
A continuación, modificamos los valores extremos y los sustituimos por los valores medios de la misma variable.

In [601]:
# MIN_PLAYERS
# Substitute 0 with min_players mean in min_players column
bgg.min_players.replace(0,1,inplace=True)

#MAX_PLAYERS
# Substitute values higher than 10 with 10 in max_players column
bgg['max_players'].where(bgg['max_players'] < 10, 10, inplace=True)
bgg.max_players.replace(0,1,inplace=True)


#MIN_TIME
bgg['min_time'].where(bgg['min_time'] < 91, 90, inplace=True)
bgg['min_time'].where(bgg['min_time'] > 16, 15, inplace=True)
bgg.min_time.replace(16,15,inplace=True)
bgg.min_time.replace(42,40,inplace=True)

#AVG_TIME

bgg['avg_time'].where(bgg.avg_time.value_counts()==1, bgg.avg_time.mean(), inplace=True)
mask = bgg.avg_time.map(bgg.avg_time.value_counts()) < 5
bgg.avg_time =  bgg.avg_time.mask(mask, bgg.avg_time.mean().round(2))
bgg.avg_time.replace(0,15,inplace=True)

#MAX_TIME
bgg['max_time'].where(bgg['max_time'] < 300, 90, inplace=True)
bgg['max_time'].where(bgg['max_time'] > 10, 10, inplace=True)
bgg['max_time'].where(bgg.max_time.value_counts()==1, bgg.max_time.mean(), inplace=True)
mask = bgg.max_time.map(bgg.max_time.value_counts()) < 5
bgg.max_time =  bgg.max_time.mask(mask, bgg.max_time.mean().round(2))
bgg.max_time.replace(0,15,inplace=True)


#YEAR
bgg['year'].where(bgg['year'] > 1950, 1950, inplace=True)
bgg.year.value_counts()


2015    425
2016    406
2014    351
2013    332
2017    313
2012    312
2011    260
2010    253
2009    224
2008    195
2007    162
2004    152
2005    150
2006    146
2003    125
2002    103
2001    86 
2000    79 
1999    76 
1950    60 
1998    56 
1997    48 
1995    48 
1992    46 
1994    43 
1996    41 
1991    36 
1993    35 
1986    32 
1990    30 
1979    29 
1989    28 
1985    26 
1987    25 
1981    24 
1988    24 
1983    23 
1977    21 
1980    20 
1982    17 
1978    17 
1984    15 
1974    13 
1976    13 
1973    12 
1975    10 
1962    7  
1972    7  
1967    6  
2018    5  
1971    5  
1969    5  
1964    4  
1970    4  
1959    3  
1960    3  
1963    2  
1966    1  
1965    1  
1968    1  
1956    1  
1951    1  
1955    1  
Name: year, dtype: int64

## 4. Análisis de los datos