<a href="https://colab.research.google.com/github/claudiflower/Coffee_Machine_Model/blob/main/BTTAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Accenture 2A: From Coffee Machines to Machine Learning

Fall 2023

Student Team: Abby Rabbany, Abir Banik, Claudia Lihar, Noor El-Hawwat, Riya Bemby

# Business Understanding

## Goal
Our project goal is to predict the best specs for a potential client at Accenture who is looking to open a series of coffee stores in New York City.

The three models we will create will make the ultimate suggestion for:

1. Location: Finding the best location for a coffee shop in New York city, taking into account factors such as foot traffic, competition, profit, crime rate, etc.

2. Three Specialty Items: Suggesting three menu items after scraping and analyzing Yelp datasets.

3. Characteristics: Suggesting other services and characteristics of popular coffee stores in New York City, such as Wifi, music, etc.

## Project Scope
We will be creating this project over the course of 3 months, from September to December 2023, and delivering an in-person presentation of our project to Accenture.

# Data Preparation

In [8]:
# Packages you will need to download
%pip install pyarrow
%pip install pandas
%pip install numpy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Collecting pandas
  Downloading pandas-2.1.1-cp311-cp311-macosx_11_0_arm64.whl (10.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2023.3.post1-py2.py3-none-any.whl (502 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m502.5/502.5 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tzdata>=2022.1 (from pandas)
  Downloading tzdata-2023.3-py2.py3-none-any.whl (341 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m341.8/341.8 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
Install

## Yellow Taxi Data

In [66]:
import pyarrow.parquet as pq
import pandas as pd
import numpy as np

# List of Parquet file names
file_names = ["taxi_data/taxi_jan2023.parquet", "taxi_data/taxi_feb2023.parquet", "taxi_data/taxi_mar2023.parquet", 
              "taxi_data/taxi_apr2023.parquet", "taxi_data/taxi_may2023.parquet", "taxi_data/taxi_jun2023.parquet", 
              "taxi_data/taxi_jan2022.parquet", "taxi_data/taxi_feb2022.parquet", "taxi_data/taxi_mar2022.parquet",
              "taxi_data/taxi_apr2022.parquet", "taxi_data/taxi_may2022.parquet", "taxi_data/taxi_jun2022.parquet",
              "taxi_data/taxi_jul2022.parquet", "taxi_data/taxi_aug2022.parquet", "taxi_data/taxi_sept2022.parquet",
              "taxi_data/taxi_oct2022.parquet", "taxi_data/taxi_nov2022.parquet", "taxi_data/taxi_dec2022.parquet",
              "taxi_data/taxi_jan2021.parquet", "taxi_data/taxi_feb2021.parquet", "taxi_data/taxi_mar2021.parquet",
              "taxi_data/taxi_apr2021.parquet", "taxi_data/taxi_may2021.parquet", "taxi_data/taxi_jun2021.parquet",
              "taxi_data/taxi_jul2021.parquet", "taxi_data/taxi_aug2021.parquet", "taxi_data/taxi_sept2021.parquet",
              "taxi_data/taxi_oct2021.parquet", "taxi_data/taxi_nov2021.parquet", "taxi_data/taxi_dec2021.parquet"]

# Initialize an empty list to store the DataFrames
taxi_dataframes = []

# Loop through the file names and read each Parquet file
for file_name in file_names:
    taxi_table = pq.read_table(file_name)
    taxi_df = taxi_table.to_pandas()
    taxi_df = taxi_df[['tpep_pickup_datetime', 'DOLocationID']]
    # print(len(taxi_df))
    taxi_dataframes.append(taxi_df)

# Concatenate all DataFrames in the list vertically
taxi_combined = pd.concat(taxi_dataframes, ignore_index=True)
taxi_combined.head()

Unnamed: 0,tpep_pickup_datetime,DOLocationID
0,2023-01-01 00:32:10,141
1,2023-01-01 00:55:08,237
2,2023-01-01 00:25:04,238
3,2023-01-01 00:03:48,7
4,2023-01-01 00:10:29,79


In [3]:
# Encode all of the neighborhoods
neighborhood_dict = {
    4	: "Alphabet City",
    12	: "Battery Park",
    13	: "Battery Park City",
    24	: "Bloomingdale",
    41	: "Central Harlem",
    42	: "Central Harlem North",
    43	: "Central Park",
    45	: "Chinatown",
    48	: "Clinton East",
    50	: "Clinton West",
    68	: "East Chelsea",
    74	: "East Harlem North",
    75	: "East Harlem South",
    79	: "East Village",
    87	: "Financial District North",
    88	: "Financial District South",
    90	: "Flatiron",
    100	: "Garment District",
    103	: "Governor's Island/Ellis Island/Liberty Island",
    104	: "Governor's Island/Ellis Island/Liberty Island",
    105	: "Governor's Island/Ellis Island/Liberty Island",
    107	: "Gramercy",
    113	: "Greenwich Village North",
    114	: "Greenwich Village South",
    116	: "Hamilton Heights",
    120	: "Highbridge Park",
    125	: "Hudson Sq",
    127	: "Inwood",
    128	: "Inwood Hill Park",
    137	: "Kips Bay",
    140	: "Lenox Hill East",
    141	: "Lenox Hill West",
    142	: "Lincoln Square East",
    143	: "Lincoln Square West",
    144	: "Little Italy",
    148	: "Lower East Side",
    151	: "Manhattan Valley",
    152	: "Manhattanville",
    153	: "Marble Hill",
    158	: "Meatpacking/West Village West",
    161	: "Midtown Center",
    162	: "Midtown East",
    163	: "Midtown North",
    164	: "Midtown South",
    166	: "Morningside Heights",
    170	: "Murray Hill",
    186	: "Penn Station/Madison Sq West",
    194	: "Randalls Island",
    202	: "Roosevelt Island",
    209	: "Seaport",
    211	: "SoHo",
    224	: "Stuy Town/Peter Cooper Village",
    229	: "Sutton Place/Turtle Bay North",
    230	: "Times Sq/Theatre District",
    231	: "TriBeCa/Civic Center",
    232	: "Two Bridges/Seward Park",
    233	: "UN/Turtle Bay South",
    234	: "Union Sq",
    236	: "Upper East Side North",
    237	: "Upper East Side South",
    238	: "Upper West Side North",
    239	: "Upper West Side South",
    243	: "Washington Heights North",
    244	: "Washington Heights South",
    246	: "West Chelsea/Hudson Yards",
    249	: "West Village",
    261	: "World Trade Center",
    262	: "Yorkville East",
    263	: "Yorkville West"
}


This code encodes all of the districts that are found in the Taxi Zone Diagram.

See: https://www.nyc.gov/assets/tlc/images/content/pages/about/taxi_zone_map_manhattan.jpg

We will now match these districts with those that we use in the demographics data.

In [4]:
""" NOTES ON DATASET PARSING
For West Village, 158 is Meatpacking + West Village West
For Chinatown-Two Bridges, 232 is Two Bridges + Seward Park
For Hell's Kitchen, nonexistent, estimated that 50 Clinton West was closest
Not sure what Upper West Side (Central) is
For Hamilton Heights-Sugar Hill 25, Sugar Hill does not exist
For Harlem (South) 26, does not exist
Some in the dict are never used
"""

neighborhoods_demo = {
    # Financial District-Battery Park City
    1: [neighborhood_dict[87], neighborhood_dict[88], neighborhood_dict[13]],
    # Tribeca-Civic Center
    2: [neighborhood_dict[231]],
    # SoHo-Little Italy-Hudson Square
    3: [neighborhood_dict[211], neighborhood_dict[144], neighborhood_dict[125]],
    # Greenwich Village
    4: [neighborhood_dict[113], neighborhood_dict[114]],
    # West Village
    5: [neighborhood_dict[249], neighborhood_dict[158]],
    # Chinatown-Two Bridges
    6: [neighborhood_dict[45], neighborhood_dict[232]],
    # Lower East Side
    7: [neighborhood_dict[148]],
    # East Village
    8: [neighborhood_dict[79]],
    # Chelsea-Hudson Yards
    9: [neighborhood_dict[68], neighborhood_dict[246]],
    # Hell's Kitchen
    10: [neighborhood_dict[50]],
    # Midtown South-Flatiron-Union Square
    11: [neighborhood_dict[164], neighborhood_dict[90], neighborhood_dict[234]],
    # Midtown-Times Square
    12: [neighborhood_dict[230]],
    # Stuyvesant Town-Peter Cooper Village
    13: [neighborhood_dict[224]],
    # Gramercy
    14: [neighborhood_dict[107]],
    # Murray Hill-Kips Bay
    15: [neighborhood_dict[170], neighborhood_dict[137]],
    # East Midtown-Turtle Bay
    16: [neighborhood_dict[162], neighborhood_dict[229], neighborhood_dict[233]],
    # Upper West Side-Lincoln Square
    17: [neighborhood_dict[142], neighborhood_dict[143]],
    # Upper West Side (Central)
    # 18
    # Upper West Side-Manhattan Valley
    19: [neighborhood_dict[151]],
    # Upper East Side-Lenox Hill-Roosevelt Island
    20: [neighborhood_dict[140], neighborhood_dict[141], neighborhood_dict[202]],
    # Upper East Side-Carnegie Hill
    # 21: [neighborhood_dict[]],
    # Upper East Side-Yorkville
    22: [neighborhood_dict[262], neighborhood_dict[263]],
    # Morningside Heights
    23: [neighborhood_dict[166]],
    # Manhattanville-West Harlem
    24: [neighborhood_dict[152]],
    # Hamilton Heights-Sugar Hill
    25: [neighborhood_dict[116]],
    # Harlem (South)
    # 26: [neighborhood_dict[75]],
    # Harlem (North)
    27: [neighborhood_dict[42]],
    # East Harlem (South)
    28: [neighborhood_dict[75]],
    # East Harlem (North)
    29: [neighborhood_dict[74]],
    # Washington Heights (South)
    30: [neighborhood_dict[244]],
    # Washington Heights (North)
    31: [neighborhood_dict[243]],
    # Inwood
    32: [neighborhood_dict[127]],
    # United Nations
    33: [neighborhood_dict[233]],
    # The Battery-Governors Island-Ellis Island-Liberty Island
    34: [neighborhood_dict[12], neighborhood_dict[103], neighborhood_dict[104], neighborhood_dict[105]],
    # Randall's Island
    35: [neighborhood_dict[194]],
    # Highbridge Park
    36: [neighborhood_dict[120]],
    # Inwood Hill Park
    37: [neighborhood_dict[128]],
    # Central Park
    38: [neighborhood_dict[43]]

}

In [5]:
print(neighborhoods_demo[1])

['Financial District North', 'Financial District South']


In [73]:
# Find the zones using TLC Taxi Zone Map, make a df with these zones 

manhattan_taxi_zones = [4, 12, 13, 24, 41, 42, 43, 45, 48, 50, 68, 74, 
                        75, 79, 87, 88, 90, 100, 100, 107, 113, 114, 116, 
                        120, 125, 127, 128, 137, 140, 141, 142, 143, 144, 
                        148, 152, 153, 158, 161, 162, 163, 164, 166, 170, 
                        186, 202, 211, 224, 229, 230, 231, 232, 233, 234, 
                        236, 237, 238, 239, 243, 244, 246, 249, 261, 262, 263]

# This df only has neighborhoods in manhattan
taxi_manhattan_df = (taxi_combined[taxi_combined['DOLocationID'].isin(manhattan_taxi_zones)]).sort_values(by='DOLocationID')


In [74]:
taxi_manhattan_df.head()

Unnamed: 0,tpep_pickup_datetime,DOLocationID
44599510,2022-08-22 14:37:09,4
32303839,2022-05-02 14:39:34,4
85123224,2021-11-16 11:05:47,4
75841949,2021-08-20 15:08:37,4
28235919,2022-03-30 12:35:48,4


In [None]:
# sum each location id that is necessary for this process, and make a new df with this new information