# Lectura de datos e Inspección inicial de una base de datos de basquetbolistas de la NBA desde 1950

### Importar librerías

In [1]:
import pandas as pd

### Lectura del conjunto de datos

In [2]:
bd = pd.read_csv("player_data.csv",delimiter=',',low_memory=False)
bd

Unnamed: 0,name,year_start,year_end,position,height,weight,birth_date,college
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240.0,"June 24, 1968",Duke University
1,Zaid Abdul-Aziz,1969,1978,C-F,6-9,235.0,"April 7, 1946",Iowa State University
2,Kareem Abdul-Jabbar,1970,1989,C,7-2,225.0,"April 16, 1947","University of California, Los Angeles"
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162.0,"March 9, 1969",Louisiana State University
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223.0,"November 3, 1974",San Jose State University
...,...,...,...,...,...,...,...,...
4545,Ante Zizic,2018,2018,F-C,6-11,250.0,"January 4, 1997",
4546,Jim Zoet,1983,1983,C,7-1,240.0,"December 20, 1953",Kent State University
4547,Bill Zopf,1971,1971,G,6-1,170.0,"June 7, 1948",Duquesne University
4548,Ivica Zubac,2017,2018,C,7-1,265.0,"March 18, 1997",


## Inspección inicial
La altura se encuentra en pies y el peso en libras

Inspección de las variables

In [3]:
bd.dtypes

name           object
year_start      int64
year_end        int64
position       object
height         object
weight        float64
birth_date     object
college        object
dtype: object

Información sobre las variables más relevantes

http://stat-computing.org/dataexpo/2009/the-data.html


### Selección de Variables de interés

In [4]:
bd=bd[['name','year_start','year_end','position','height','weight']]
bd

Unnamed: 0,name,year_start,year_end,position,height,weight
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240.0
1,Zaid Abdul-Aziz,1969,1978,C-F,6-9,235.0
2,Kareem Abdul-Jabbar,1970,1989,C,7-2,225.0
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162.0
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223.0
...,...,...,...,...,...,...
4545,Ante Zizic,2018,2018,F-C,6-11,250.0
4546,Jim Zoet,1983,1983,C,7-1,240.0
4547,Bill Zopf,1971,1971,G,6-1,170.0
4548,Ivica Zubac,2017,2018,C,7-1,265.0


### Contenido de la base de datos: Variables

Listar los nombres de las variables o característica (*features*).

In [5]:
bd.columns

Index(['name', 'year_start', 'year_end', 'position', 'height', 'weight'], dtype='object')

Listar las variables y el tipo de dato

In [6]:
bd.dtypes

name           object
year_start      int64
year_end        int64
position       object
height         object
weight        float64
dtype: object

### Contenido de la base de datos: Registros

Listar los primeros registros del conjunto de datos

In [7]:
bd.head(10)

Unnamed: 0,name,year_start,year_end,position,height,weight
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240.0
1,Zaid Abdul-Aziz,1969,1978,C-F,6-9,235.0
2,Kareem Abdul-Jabbar,1970,1989,C,7-2,225.0
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162.0
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223.0
5,Shareef Abdur-Rahim,1997,2008,F,6-9,225.0
6,Tom Abernethy,1977,1981,F,6-7,220.0
7,Forest Able,1957,1957,G,6-3,180.0
8,John Abramovic,1947,1948,F,6-3,195.0
9,Alex Abrines,2017,2018,G-F,6-6,190.0


Contar la cantidad de registros para cada columna

### Funciones de gestión de datos

In [8]:
bd.count()

name          4550
year_start    4550
year_end      4550
position      4549
height        4549
weight        4544
dtype: int64

Cálcular la cantidad mínimima de registros por columna.

In [9]:
bd.count().min()

4544

Calcular los registros de la variable **weight**

In [10]:
bd['weight'].count()

4544

Seleccionar los registros con peso mayor a 200 libras y contar los registros por columna.

In [11]:
bd[bd['weight'] > 200].count()

name          2582
year_start    2582
year_end      2582
position      2582
height        2582
weight        2582
dtype: int64

Calcular el total de posiciones no repetidas.

In [12]:
bd['position'].drop_duplicates().count()

7

Calcular el peso promedio de todos los jugadores.

In [14]:
bd['weight'].mean()

208.9080105633803

Calcular el promedio de cada variable según la posición.

In [15]:
bd.groupby(['position']).mean().head()

Unnamed: 0_level_0,year_start,year_end,weight
position,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C,1988.452191,1992.577689,244.552
C-F,1978.43379,1984.552511,229.0
F,1986.827907,1990.018605,218.139643
F-C,1980.458763,1986.770619,223.93299
F-G,1975.75463,1981.805556,203.087963


Describir la información estadística de la variable **height**

In [17]:
bd['height'].describe()

count     4549
unique      28
top        6-7
freq       473
Name: height, dtype: object

Número total de jugadores agrupados por posicion. Se agrupan en orden ascendente.

In [21]:
bd_dest_size=bd.groupby('position').size().reset_index(name='n').sort_values(by='n', ascending=True).reset_index(drop=True)
bd_dest_size

Unnamed: 0,position,n
0,F-G,216
1,C-F,219
2,G-F,360
3,F-C,388
4,C,502
5,F,1290
6,G,1574


Número total de jugadores agrupados por posicion y por altura. Se agrupan en orden ascendente.

In [23]:
bd_dest_ori=bd.groupby(['position','height']).size().reset_index(name='n').sort_values(by='n', ascending=True).reset_index(drop=True)
bd_dest_ori

Unnamed: 0,position,height,n
0,G-F,6-10,1
1,C,6-4,1
2,G,6-9,1
3,F-G,6-11,1
4,F-G,5-11,1
...,...,...,...
92,F,6-9,242
93,G,6-2,251
94,F,6-7,277
95,F,6-8,287


Agrupar por posicion y promedio de peso por posicion.

In [28]:
bd_mean=bd.groupby(['position'])[['weight']].mean().rename(columns={'weight': 'mWeight'})\
.sort_values(by='mWeight', ascending=False).reset_index()
bd_mean

Unnamed: 0,position,mWeight
0,C,244.552
1,C-F,229.0
2,F-C,223.93299
3,F,218.139643
4,F-G,203.087963
5,G-F,197.611111
6,G,186.880407


In [29]:
bd_mean_size=pd.merge(bd_mean, bd_dest_size, on='position')
bd_mean_size

Unnamed: 0,position,mWeight,n
0,C,244.552,502
1,C-F,229.0,219
2,F-C,223.93299,388
3,F,218.139643,1290
4,F-G,203.087963,216
5,G-F,197.611111,360
6,G,186.880407,1574
