# Pandas - Exercise

## Introduction

### Context:

#### This dataset is a record of each building or building unit (apartment, etc.) sold on the New York real estate market during a 12 month period.

### Contents:

#### This dataset contains the location, address, type, sale price and sale date of units in the building. Here are some references about the fields:

* BOROUGH: A code to define the neighborhood in which the property is located:* BLOCK; LOT: The combination of the "borough", "block" block, and "lot" lot forms a unique key to the property in New York City. Called BBL.
​
* BUILDING CLASS AT PRESENT and BUILDING CLASS AT TIME OF SALE: The type of building at various points in time. See the glossary below:
    - Manhattan (1), 
    - Bronx (2), 
    - Brooklyn (3), 
    - Queens (4), 
    - Staten Island (5).

* BLOCK; LOT: A combinação do bairro "borough", bloco "block", e lote "lot" forma uma chave única para a propriedade em New York City. Chamado de BBL.

* BUILDING CLASS AT PRESENT e BUILDING CLASS AT TIME OF SALE: O tipo de edifício em vários pontos no tempo. Veja o glossário abaixo:

#### For additional reference on individual fields, see the [Glossário de Termos](https://www1.nyc.gov/assets/finance/downloads/pdf/07pdf/glossary_rsf071607.pdf). For building classification codes, see the Glossary of Building Classifications [NYC Property Sales](https://www.kaggle.com/new-york-city/nyc-property-sales).


## Lets import the necessary packages and load the data.

In [1]:
import numpy as np
import pandas as pd
from scipy import stats, integrate
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks")

In [3]:
# Reading the CSV 
nyc = pd.read_csv('nyc-rolling-sales_twentieth.csv')

### Exercise 1: Evaluate the [types](https://realpython.com/python-data-types/#type-conversion) columns and make the necessary changes.

In [6]:
nyc.dtypes

Unnamed: 0                         int64
BOROUGH                            int64
NEIGHBORHOOD                      object
BUILDING CLASS CATEGORY           object
TAX CLASS AT PRESENT              object
BLOCK                              int64
LOT                                int64
EASE-MENT                         object
BUILDING CLASS AT PRESENT         object
ADDRESS                           object
APARTMENT NUMBER                  object
ZIP CODE                           int64
RESIDENTIAL UNITS                  int64
COMMERCIAL UNITS                   int64
TOTAL UNITS                        int64
LAND SQUARE FEET                  object
GROSS SQUARE FEET                 object
YEAR BUILT                         int64
TAX CLASS AT TIME OF SALE          int64
BUILDING CLASS AT TIME OF SALE    object
SALE PRICE                        object
SALE DATE                         object
dtype: object

Observation:

Columns that should be numeric, such as LAND SQUARE FEET, GROSS SQUARE FEET, SALE PRICE are in object format. To work with these columns, we will need to transform them to numeric

### Let's change the types of some columns.

In [7]:
# Change type of variables to numeric
## the errors='coerce' parameter forces the transformation. What he
## fail to transform, will return as NaN

nyc['LAND SQUARE FEET'] = pd.to_numeric(nyc['LAND SQUARE FEET'], errors='coerce')
nyc['GROSS SQUARE FEET'] = pd.to_numeric(nyc['GROSS SQUARE FEET'], errors='coerce')
nyc['SALE PRICE'] = pd.to_numeric(nyc['SALE PRICE'], errors='coerce')

### ### Let's eliminate the lines that contain `NaN` values

In [9]:
# Discard lines with null observations
nyc.dropna(inplace=True)


### What is the average value per square foot in NY?

In [11]:
# Cria coluna com valor convertido de feet quadrado para m2
nyc['LAND SQUARE METER'] = nyc['LAND SQUARE FEET'] * 0.092903
nyc['GROSS SQUARE METER'] = nyc['GROSS SQUARE FEET'] * 0.092903

In [12]:
# Calculate the price per m2 of each apartment
nyc['PRICE M2'] = nyc['SALE PRICE'] / nyc['LAND SQUARE METER']

# View the first lines of the dataframe
nyc.head()

Unnamed: 0.1,Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,LOT,EASE-MENT,BUILDING CLASS AT PRESENT,ADDRESS,...,LAND SQUARE FEET,GROSS SQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SALE PRICE,SALE DATE,LAND SQUARE METER,GROSS SQUARE METER,PRICE M2
0,4,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,392,6,,C2,153 AVENUE B,...,1633.0,6440.0,1900,2,C2,6625000.0,2017-07-19 00:00:00,151.710599,598.29532,43668.669451
3,7,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2B,402,21,,C4,154 EAST 7TH STREET,...,2272.0,6794.0,1913,2,C4,3936272.0,2016-09-23 00:00:00,211.075616,631.182982,18648.634431
4,8,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,404,55,,C2,301 EAST 10TH STREET,...,2369.0,4615.0,1900,2,C2,8000000.0,2016-11-17 00:00:00,220.087207,428.747345,36349.227695
6,10,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2B,406,32,,C4,210 AVENUE B,...,1750.0,4226.0,1920,2,C4,3192840.0,2016-09-23 00:00:00,162.58025,392.608078,19638.547733
9,13,1,ALPHABET CITY,08 RENTALS - ELEVATOR APARTMENTS,2,387,153,,D9,629 EAST 5TH STREET,...,4489.0,18523.0,1920,2,D9,16232000.0,2016-11-07 00:00:00,417.041567,1720.842269,38921.779708


In [13]:
# Calculate the average price per square meter
nyc['PRICE M2'].mean()

38029.45383380301

### What is the average price per square meter of each `BLOCK`? Arrange the data to indicate which is the most expensive.
Tip: do the calculation with groupby

In [14]:
m2_per_block = nyc.groupby('BLOCK')[['PRICE M2']].mean()
m2_per_block.head()

Unnamed: 0_level_0,PRICE M2
BLOCK,Unnamed: 1_level_1
10,157674.187418
27,153969.473449
29,164119.372653
38,1282.30988
40,181858.290247


In [15]:
m2_per_block.sort_values(by='PRICE M2', ascending=False)

Unnamed: 0_level_0,PRICE M2
BLOCK,Unnamed: 1_level_1
175,578463.342097
1369,449804.030322
52,358289.131786
1301,292468.922287
1548,252843.642142
...,...
1000,0.040621
1844,0.004629
1773,0.004266
599,0.001558


### In which `BLOCK` is there the greatest dispersion of prices per square meter? Arrange the values ​​to identify the largest.

(Remember the coefficient of variation formula to measure dispersion)

**Clues**
* The first option is to define a function with arrays and use `.apply()`

In [16]:
def calcular_CV(coluna):
    CV  = coluna.std() / coluna.mean()
    return CV

* The second is to generate two series:
  - one with the `.std()` method and divide it by another series generated with `mean()`

In [17]:
nyc.groupby('BLOCK')[['PRICE M2']].apply(calcular_CV).sort_values(by='PRICE M2', ascending=False)

Unnamed: 0_level_0,PRICE M2
BLOCK,Unnamed: 1_level_1
772,1.999991
2064,1.899623
1041,1.732041
1468,1.414213
1553,1.414213
...,...
2226,
2234,
2238,
2242,


In [18]:
std_preco = nyc.groupby('BLOCK')[['PRICE M2']].std()

In [19]:
preco_medio = nyc.groupby('BLOCK')[['PRICE M2']].mean()

In [20]:
CV_block = std_preco/preco_medio

In [21]:
CV_block.sort_values(by='PRICE M2', ascending=False)

Unnamed: 0_level_0,PRICE M2
BLOCK,Unnamed: 1_level_1
772,1.999991
2064,1.899623
1041,1.732041
1468,1.414213
1553,1.414213
...,...
2226,
2234,
2238,
2242,



### In which neighborhood are the apartments bigger?

In [22]:
size_per_block = nyc.groupby('BLOCK')[['LAND SQUARE METER']].mean()


size_per_block.sort_values(by='LAND SQUARE METER', ascending=False)

Unnamed: 0_level_0,LAND SQUARE METER
BLOCK,Unnamed: 1_level_1
934,13176.989908
1737,12979.942645
1301,7556.358408
1730,6698.538558
599,6419.225688
...,...
1583,78.503035
1967,76.412717
625,70.327571
888,58.064375


### In general, can you see any difference between the average price per square meter of the apartments, considering their year of construction? what can you say about the relationship between the year of construction and their average total size in square feet?

In [23]:
nyc.groupby('YEAR BUILT')[['PRICE M2']].median()

Unnamed: 0_level_0,PRICE M2
YEAR BUILT,Unnamed: 1_level_1
1800,75114.328458
1850,659.874792
1880,35313.624176
1890,9857.419363
1899,32511.166668
...,...
2010,169678.148770
2013,29294.199190
2014,38676.982714
2015,102099.935614


In [24]:
nyc.groupby('YEAR BUILT')[['LAND SQUARE FEET']].median()

Unnamed: 0_level_0,LAND SQUARE FEET
YEAR BUILT,Unnamed: 1_level_1
1800,1016.0
1850,4600.0
1880,3731.0
1890,1791.5
1899,1674.0
...,...
2010,8616.0
2013,4428.0
2014,7193.5
2015,18344.0


**Observation:**

The idea here is to realize that, despite the fluctuations, it seems that the price of m2 has an upward trend over the years, while the size of apartments tends to decrease over the years.

# Advanced

### Generate a `DataFrame` that appends information by (`PRICE PER SQUARED FEET`), residential units (`RESIDENTIAL UNITS`) and commercial units (`COMMERCIAL UNITS`) by `BLOCK` and neighborhood (`NEIGHBORHOOD`) . Provide information about the central tendency and dispersion of both distributions.

(Hint: look for the **pivot_table** function)

In [26]:
nyc.pivot_table(
    # Define the colummns 
    [
     'PRICE M2',
     'RESIDENTIAL UNITS',
     'COMMERCIAL UNITS'
     ],
    # Define the rows 
     index = [
              'BLOCK',
              'NEIGHBORHOOD'
              ],
    # Define the agregations 
    aggfunc={
        'PRICE M2':[np.mean,np.std,len],
        'RESIDENTIAL UNITS': [np.mean,np.std],
        'COMMERCIAL UNITS': [np.mean,np.std]
        }
)

Unnamed: 0_level_0,Unnamed: 1_level_0,COMMERCIAL UNITS,COMMERCIAL UNITS,PRICE M2,PRICE M2,PRICE M2,RESIDENTIAL UNITS,RESIDENTIAL UNITS
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,len,mean,std,mean,std
BLOCK,NEIGHBORHOOD,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
10,FINANCIAL,1.0,,1,157674.187418,,0.0,
27,FINANCIAL,6.0,0.000000,2,153969.473449,11202.224238,396.5,112.429978
29,FINANCIAL,1.0,,1,164119.372653,,0.0,
38,FINANCIAL,23.0,,1,1282.309880,,0.0,
40,FINANCIAL,1.5,0.707107,2,181858.290247,37998.757947,0.0,0.000000
...,...,...,...,...,...,...,...,...
2234,INWOOD,0.0,,1,14755.830567,,36.0,
2238,INWOOD,1.0,,1,0.001435,,28.0,
2242,INWOOD,0.0,,1,18513.933888,,48.0,
2248,INWOOD,0.0,,1,14376.846538,,61.0,
