# Importing and Inspecting Data

Data source: https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset/data



Let's begin by importing our data from our [csv](https://en.wikipedia.org/wiki/Comma-separated_values) file to a [dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) using the [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) function.

In [1]:
# import our libraries
import pandas as pd # common data science library

In [3]:
df = pd.read_csv('realestatedata.csv') # read in our data

df # check out our data

Unnamed: 0,status,bed,bath,acre_lot,city,state,zip_code,house_size,prev_sold_date,price
0,for_sale,3.0,2.0,0.12,Adjuntas,Puerto Rico,601.0,920.0,,105000.0
1,for_sale,4.0,2.0,0.08,Adjuntas,Puerto Rico,601.0,1527.0,,80000.0
2,for_sale,2.0,1.0,0.15,Juana Diaz,Puerto Rico,795.0,748.0,,67000.0
3,for_sale,4.0,2.0,0.10,Ponce,Puerto Rico,731.0,1800.0,,145000.0
4,for_sale,6.0,2.0,0.05,Mayaguez,Puerto Rico,680.0,,,65000.0
...,...,...,...,...,...,...,...,...,...,...
1305061,for_sale,1.0,,7.15,Malone,New York,12953.0,360.0,2021-07-02,124900.0
1305062,for_sale,2.0,1.0,4.70,Malone,New York,12953.0,624.0,2011-12-14,79900.0
1305063,for_sale,,,21.00,Ellenburg Center,New York,12934.0,,,55000.0
1305064,for_sale,2.0,1.0,0.54,Owls Head,New York,12969.0,936.0,2007-10-12,495000.0


Now lets look at the first 5 rows of our data using the [head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) function.

In [4]:
df.head() # check out the first 5 rows of our data

Unnamed: 0,status,bed,bath,acre_lot,city,state,zip_code,house_size,prev_sold_date,price
0,for_sale,3.0,2.0,0.12,Adjuntas,Puerto Rico,601.0,920.0,,105000.0
1,for_sale,4.0,2.0,0.08,Adjuntas,Puerto Rico,601.0,1527.0,,80000.0
2,for_sale,2.0,1.0,0.15,Juana Diaz,Puerto Rico,795.0,748.0,,67000.0
3,for_sale,4.0,2.0,0.1,Ponce,Puerto Rico,731.0,1800.0,,145000.0
4,for_sale,6.0,2.0,0.05,Mayaguez,Puerto Rico,680.0,,,65000.0


In [5]:
df.head(10) # check out the first 10 rows of our data

Unnamed: 0,status,bed,bath,acre_lot,city,state,zip_code,house_size,prev_sold_date,price
0,for_sale,3.0,2.0,0.12,Adjuntas,Puerto Rico,601.0,920.0,,105000.0
1,for_sale,4.0,2.0,0.08,Adjuntas,Puerto Rico,601.0,1527.0,,80000.0
2,for_sale,2.0,1.0,0.15,Juana Diaz,Puerto Rico,795.0,748.0,,67000.0
3,for_sale,4.0,2.0,0.1,Ponce,Puerto Rico,731.0,1800.0,,145000.0
4,for_sale,6.0,2.0,0.05,Mayaguez,Puerto Rico,680.0,,,65000.0
5,for_sale,4.0,3.0,0.46,San Sebastian,Puerto Rico,612.0,2520.0,,179000.0
6,for_sale,3.0,1.0,0.2,Ciales,Puerto Rico,639.0,2040.0,,50000.0
7,for_sale,3.0,2.0,0.08,Ponce,Puerto Rico,731.0,1050.0,,71600.0
8,for_sale,2.0,1.0,0.09,Ponce,Puerto Rico,730.0,1092.0,,100000.0
9,for_sale,5.0,3.0,7.46,Las Marias,Puerto Rico,670.0,5403.0,,300000.0


Now lets look at some of the statistics of our data using the [describe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function.

In [6]:
# describe our data
df.describe()

Unnamed: 0,bed,bath,acre_lot,zip_code,house_size,price
count,1114504.0,1137530.0,953086.0,1304612.0,873721.0,1304958.0
mean,3.380298,2.507434,34.046572,7982.158,2191.381,862444.0
std,2.065159,1.90602,1305.264075,4028.216,3615.949,2773515.0
min,1.0,1.0,0.0,601.0,4.0,0.0
25%,2.0,2.0,0.11,5061.0,1152.0,260000.0
50%,3.0,2.0,0.3,8098.0,1700.0,475000.0
75%,4.0,3.0,1.26,11235.0,2524.0,824900.0
max,123.0,198.0,100000.0,99999.0,1450112.0,875000000.0


What if we wanted to look at each column by itself? We can select a column by using the following syntax: `dataframe['column_name']`. Let's look at the `price` column.

In [8]:
# select the price column
df['price']

# alternative syntax
df.price

0          105000.0
1           80000.0
2           67000.0
3          145000.0
4           65000.0
             ...   
1305061    124900.0
1305062     79900.0
1305063     55000.0
1305064    495000.0
1305065    199000.0
Name: price, Length: 1305066, dtype: float64

In [9]:
# describe the price column
df['price'].describe()

count    1.304958e+06
mean     8.624440e+05
std      2.773515e+06
min      0.000000e+00
25%      2.600000e+05
50%      4.750000e+05
75%      8.249000e+05
max      8.750000e+08
Name: price, dtype: float64

In [12]:
# we can also select individual metrics
df['price'].max()

875000000.0

What if we wanted to sort our data? We can use the [sort_values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) function. Let's sort our data by `price` in ascending order.

In [14]:
# sort the price column
df['price'].sort_values()

# here's another way to find the max price
df['price'].sort_values(ascending=False).head(1)

572886    875000000.0
Name: price, dtype: float64

Some further learning could be: 
- the [apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) function, which allows you to apply a function to each row or column of a dataframe. 
- the [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function, which allows you to group your data by a column and apply a function to each group.
- the [merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function, which allows you to merge two dataframes together.
- numpy, which is a library that allows you to do mathematical operations on arrays. Pandas dataframes are built on top of numpy arrays.