# Project 1

We are going to analyze the sales of real estates in Manhattan in New York from November 2024 to December. 
Using a rolling sales data from https://www.nyc.gov/site/finance/property/property-rolling-sales-data.page, we are going to compute the mean, median and mode of the sale price.  

At first, we are going to download a data file. The NYC Department of Finance provides a only excel data, so you should convert it into csv file before you read the file in Python.

In [36]:
import pandas as pd
df = pd.read_csv("rollingsales_manhattan.csv", skiprows= 4)

# clean your data.
df.columns = df.columns.str.strip()

df["SALE PRICE"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 18491 entries, 0 to 18490
Series name: SALE PRICE
Non-Null Count  Dtype 
--------------  ----- 
18491 non-null  object
dtypes: object(1)
memory usage: 144.6+ KB


Then, we are going to compute the mean, median and mode of all prices listed. Before moving on, we should remove 0 in the list.

In [None]:
# clean your data(remove "," in the column and change data type)
df["SALE PRICE"] = pd.to_numeric(df["SALE PRICE"].str.replace(",",""))

# remove 0
df_above_0 = df[df["SALE PRICE"] > 0]

average_sale_price = df_above_0["SALE PRICE"].mean()
median_sale_price = df_above_0["SALE PRICE"].median()
mode_sale_price = df_above_0["SALE PRICE"].mode()

print(average_sale_price)
print(median_sale_price)
print(mode_sale_price)

4255161.156563907
1260000.0
0    550000
Name: SALE PRICE, dtype: int64


Let's do it again in "hard" way. You may not use pandas, the statistics module, a spreadsheet program, etc. You should be using the same dataset from the first step, but not accessing the DataFrame/Series.
In other words, if put the code for this step in a totally separate notebook, it should still work. You should be calculating the mean, median, and mode yourself, not using functions with those names (or equivalent).

In [None]:
import csv
with open('rollingsales_manhattan.csv', mode='r',encoding="utf-8") as f:
        reader = csv.reader(f)
        
        price_list = []
        price_counts = {}
        
        # skip first 5 rows
        for _ in range(5):
            next(reader)
        
        # clean each figure in the SALE PRICE column and add it to the list
        for row in reader:
               price = int(row[19].replace(",",""))
               if price > 0:
                price_list.append(price)
                if price in price_counts:
                     price_counts[price] = price_counts[price] + 1
                else:
                     price_counts[price] = 1
        
        # compute the mean
        total_sum = 0
        total_count = 0
        for p in price_list:
              total_sum += p
              total_count += 1
        mean = total_sum/total_count
        
        # compute the median
        price_list.sort()
        mid_index = total_count//2
        if total_count % 2 == 1:
              median = price_list[mid_index]
        else:
              value_1 = price_list[mid_index-1]
              value_2 = price_list[mid_index]
              median = (value_1 + value_2) / 2
        
        # compute the mode
        max_count = 0
        mode_list = []
        for price in price_counts:
             if price_counts[price] > max_count:
                max_count = price_counts[price]
                mode_list = price
             elif price_counts[price] == max_count:
                  mode_list.append(price)
        mode = mode_list
        print(mean)
        print(median)
        print(mode)
                 

        
        


4255161.156563907
1260000
550000


Next, we are going to visualize the data we got before. Requirements is as follows:
The data/calculations can come through pandas, but the drawing code should only use the Python standard library.
In other words, don’t use plot(), plotly, or any other external packages.
The visualization should be visual, using shape, size, symbols, etc. to represent the values. — Printing the numbers (as is) isn’t sufficient.

In [59]:
scale = 50000

stats_data = {"average": average_sale_price, "median":median_sale_price}

print("The comparison between average and median")   
for key in stats_data:
    bar_height = int(stats_data[key]/scale)
    bar = "#" * bar_height
    print(f"{key}:" f"{bar}", f"{stats_data[key]}")



The comparison between average and median
average:##################################################################################### 4255161.156563907
median:######################### 1260000.0
