# **Universidad Galileo**

## **Ciencia de Datos en Python**

### **César Luis Polanco, 20062088**
### **Tarea No. 5 - Vectores y Numpy**

## **Introducción del proyecto final de curso**

El proyecto consiste en aplicar los conocimientos aprendidos en clase (y apoyándose de referencias adicionales útiles) para crear **modelos predictivos de regresión lineal uni-variable** sencillos de la forma: $$y = f(x) =  mx +b$$ 

Donde:
- y = la variable dependiente
- x = variable independiente
- m = pendiente de la recta(parámetro del modelo)
- b = intercepto(parámetro del modelo)



In [12]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

In [13]:
url = "./proyecto_training_data.npy"
datos = np.load(url) 
print(datos[1-5:])

[[2.10000e+05 6.00000e+00 2.07300e+03 7.00000e+00 1.97800e+03 8.50000e+01]
 [2.66500e+05 7.00000e+00 1.18800e+03 9.00000e+00 1.94100e+03 6.60000e+01]
 [1.42125e+05 5.00000e+00 1.07800e+03 5.00000e+00 1.95000e+03 6.80000e+01]
 [1.47500e+05 5.00000e+00 1.25600e+03 6.00000e+00 1.96500e+03 7.50000e+01]]


### **Explicación de datos (texto literal de archivo)**

- **SalePrice** - the property's sale price in dollars. This is the target variable that you're trying to predict.
- **OverallQual**: Overall material and finish quality, rates the overall material and finish of the house
    - 10 -> Very Excellent
    - 9 -> Excellent
    - 8 -> Very Good
    - 7 -> Good
    - 6 -> Above Average
    - 5 -> Average
    - 4 -> Below Average
    - 3 -> Fair
    - 2 -> Poor
    - 1 -> Very Poor
- **1stFlrSF**: First Floor square feet
- **TotRmsAbvGrd**: Total rooms above grade (does not include bathrooms)
- **YearBuilt**: Original construction date
- **LotFrontage**: Linear feet of street connected to property

In [38]:
encabezados = list(["PrecioVenta","CalidadMaterial","PiesCuadradosPisoUno","TotalHabitaciones","AñoConstruccion","PiesLinealesDePropiedad"])
df = pd.DataFrame(datos, columns=encabezados)
df.head()

Unnamed: 0,PrecioVenta,CalidadMaterial,PiesCuadradosPisoUno,TotalHabitaciones,AñoConstruccion,PiesLinealesDePropiedad
0,208500.0,7.0,856.0,8.0,2003.0,65.0
1,181500.0,6.0,1262.0,6.0,1976.0,80.0
2,223500.0,7.0,920.0,6.0,2001.0,68.0
3,140000.0,7.0,961.0,7.0,1915.0,60.0
4,250000.0,8.0,1145.0,9.0,2000.0,84.0


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PrecioVenta              1460 non-null   float64
 1   CalidadMaterial          1460 non-null   float64
 2   PiesCuadradosPisoUno     1460 non-null   float64
 3   TotalHabitaciones        1460 non-null   float64
 4   AñoConstruccion          1460 non-null   float64
 5   PiesLinealesDePropiedad  1201 non-null   float64
dtypes: float64(6)
memory usage: 68.6 KB


## **Verificación de datos**

In [19]:
df.shape

(1460, 6)

In [20]:
## Verificación de nan values
df[df.isna().any(axis=1)]

Unnamed: 0,PrecioVenta,CalidadMaterial,PiesCuadradosPisoUno,TotalHabitaciones,AñoConstruccion,PiesLinealesDePropiedad
7,200000.0,7.0,1107.0,7.0,1973.0,
12,144000.0,5.0,912.0,4.0,1962.0,
14,157000.0,6.0,1253.0,5.0,1960.0,
16,149000.0,6.0,1004.0,5.0,1970.0,
24,154000.0,5.0,1060.0,6.0,1968.0,
...,...,...,...,...,...,...
1429,182900.0,6.0,1440.0,7.0,1981.0,
1431,143750.0,6.0,958.0,5.0,1976.0,
1441,149300.0,6.0,848.0,3.0,2004.0,
1443,121000.0,6.0,952.0,4.0,1916.0,


In [21]:
##Remplazar valores NaN
df['PiesLinealesDePropiedad'] = df['PiesLinealesDePropiedad'].fillna(0)

In [22]:
##Rectificar dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PrecioVenta              1460 non-null   float64
 1   CalidadMaterial          1460 non-null   float64
 2   PiesCuadradosPisoUno     1460 non-null   float64
 3   TotalHabitaciones        1460 non-null   float64
 4   AñoConstruccion          1460 non-null   float64
 5   PiesLinealesDePropiedad  1460 non-null   float64
dtypes: float64(6)
memory usage: 68.6 KB


In [23]:
#Cambio de presentación de datos

df["CalidadMaterial"] = df["CalidadMaterial"].astype(int)
df["TotalHabitaciones"] = df["TotalHabitaciones"].astype(int)
df["AñoConstruccion"] = df["AñoConstruccion"].astype(int)

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PrecioVenta              1460 non-null   float64
 1   CalidadMaterial          1460 non-null   int32  
 2   PiesCuadradosPisoUno     1460 non-null   float64
 3   TotalHabitaciones        1460 non-null   int32  
 4   AñoConstruccion          1460 non-null   int32  
 5   PiesLinealesDePropiedad  1460 non-null   float64
dtypes: float64(3), int32(3)
memory usage: 51.5 KB


In [26]:
#Visualización inicial
df

Unnamed: 0,PrecioVenta,CalidadMaterial,PiesCuadradosPisoUno,TotalHabitaciones,AñoConstruccion,PiesLinealesDePropiedad
0,208500.0,7,856.0,8,2003,65.0
1,181500.0,6,1262.0,6,1976,80.0
2,223500.0,7,920.0,6,2001,68.0
3,140000.0,7,961.0,7,1915,60.0
4,250000.0,8,1145.0,9,2000,84.0
...,...,...,...,...,...,...
1455,175000.0,6,953.0,7,1999,62.0
1456,210000.0,6,2073.0,7,1978,85.0
1457,266500.0,7,1188.0,9,1941,66.0
1458,142125.0,5,1078.0,5,1950,68.0


In [30]:
#Trains Test Split by loc
rows, cols = df.shape
df_train = df.loc[:int(rows*0.8),]
df_test = df.loc[int(rows*0.8)+1:,]

In [33]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1169 entries, 0 to 1168
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PrecioVenta              1169 non-null   float64
 1   CalidadMaterial          1169 non-null   int32  
 2   PiesCuadradosPisoUno     1169 non-null   float64
 3   TotalHabitaciones        1169 non-null   int32  
 4   AñoConstruccion          1169 non-null   int32  
 5   PiesLinealesDePropiedad  1169 non-null   float64
dtypes: float64(3), int32(3)
memory usage: 41.2 KB


In [37]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 291 entries, 1169 to 1459
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PrecioVenta              291 non-null    float64
 1   CalidadMaterial          291 non-null    int32  
 2   PiesCuadradosPisoUno     291 non-null    float64
 3   TotalHabitaciones        291 non-null    int32  
 4   AñoConstruccion          291 non-null    int32  
 5   PiesLinealesDePropiedad  291 non-null    float64
dtypes: float64(3), int32(3)
memory usage: 10.4 KB


## **Exploración de datos**

In [42]:
#Se analiza df_train 
round(df_train.describe(),4)

Unnamed: 0,PrecioVenta,CalidadMaterial,PiesCuadradosPisoUno,TotalHabitaciones,AñoConstruccion,PiesLinealesDePropiedad
count,1169.0,1169.0,1169.0,1169.0,1169.0,1169.0
mean,180636.8212,6.1009,1156.3918,6.4859,1971.42,57.6638
std,78798.0219,1.3774,373.6276,1.6085,29.9579,34.1698
min,34900.0,1.0,334.0,2.0,1875.0,0.0
25%,129900.0,5.0,882.0,5.0,1954.0,43.0
50%,163000.0,6.0,1086.0,6.0,1973.0,64.0
75%,214000.0,7.0,1390.0,7.0,2000.0,79.0
max,755000.0,10.0,3228.0,14.0,2010.0,313.0


In [57]:
round(df_train.describe().append(np.ptp(df_train,axis=0)), 4)

TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid