## For this week, you will required to submit the following:

**A description of the problem and a discussion of the background. (15 marks)**

Tampa, FL has a reputation of being a city where most citizens live in the surrounding suburbs and commute into the city for work. Because of this, the population isn't as dense as other cities, which can make it difficult for restaurant businesses to pick locations for new venues. I'd like to explore whether the existing venues (in a certain suburb?) can be modeled  based on their local area demographics to determine the best areas for a new venue to achieve a high number of "likes" or a high rating on Foursquare.

Being able to predict this information would help hedge against the usual risks that any small business faces when starting up. Sustaining a business with effective management is also important, but in order to even reach that point, a prospective founder needs to choose a location and actually open their venue. Mobile businesses such as food trucks are an alternative to making a solid decision about a specific location, but if that isn't a viable option, then it would be important to be well-informed about the potential customer base near any potential locations - otherwise, a struggling business may have to choose between relocating or closing shop altogether.

Most restaurants tend to be more popular with certain demographics than others. Market research may be required to identify what those target demographics actually are, and even if a businessperson thinks they know best, it wouldn't take much time or effort to employ some data science to convert publically-available information about local demographics and consumer sentiment about comparable venues into actionable insight on areas that are prime for a new venue.

**A description of the data and how it will be used to solve the problem. (15 marks)**

##### Data

+ Foursquare:
    + Venue locations (lat/long)
    + Venue category?
    + Venue "likes" from Foursquare are only available in Venue Details, at 1 result per call. In order to avoid burning through alloted calls by repeatedly requesting the same data, I'll probably keep a finalized JSON file after figuring out the data I want, and load it into Python from storage. 

+ Census.gov
    + Demographic breakdown: CSV files of "SEX BY AGE" (B01001 and B01001A-I, ACS2017) per Census Tract for all of Florida (the Tampa Bay area includes a few counties so I'm not narrowing this data down too much yet) 
        + All races/ethnicities https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001/0400000US12.14000
        + White alone https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001A/0400000US12.14000
        + Black alone https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001B/0400000US12.14000
        + Am. Indian alone https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001C/0400000US12.14000
        + Asian alone https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001D/0400000US12.14000
        + Hawaiian alone https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001E/0400000US12.14000
        + Other alone https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001F/0400000US12.14000
        + 2+ races https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001G/0400000US12.14000
        + White alone, non-Hispanic https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001H/0400000US12.14000
        + Hispanic https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001I/0400000US12.14000
    
    + Shapefiles for census tracts based on the same year as the demographic data (2017) https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2017&layergroup=Census+Tracts


##### Plan

1. Get lat/long for center of Brandon, FL (suburb of Tampa)
2. Get venues within 15 mile radius of center of Brandon, FL
    + Filter by category to restaurants (Mexican specifically?)
    + Take list of venueIDs and iterate thru to call for venue details of each, specifically seeking number of "likes" and the rating
    + Store this data to avoid repeating this call (it'll eat thru the allotment quickly)
3. Import shapefiles and get lat/long for center of each census tract
    + Find distance of each to center of Brandon, FL; filter to census tracts within 15 mile radius
4. Import census tract demographic data from CSVs
    + Import per-race CSVs into individual dataframes; from 4th column on, only need every other column (estimates only, no need for margin of error; column headers are confusing bc they include an extra headers, causing them to be misaligned)
    + Append them into a single master dataframe with an additional column to identify which racial dataframe the rows came from
    + Pivot to reduce data to 1 row per census tract (4247 rows for FL); with age/sex/race and age/sex subpopulation columns (23x2x9 + 23x2x1 = 460 columns)
    + Add additional subpopulation columns aggregating by age/race, age, and race (23x10 + 23 + 2 = 255 more columns)
    + Filter to relevant census tracts
5. Calculate distance from each venue to each census tract, then calculate proximity of subpopulations as the weighted average value for each demographic subpopulation based on the distances of the census tracts from the venue times the proportion of the subpopulation in that tract vs all the tracts, divided by the number of census tracts ("how far away from the venue is this subpopulation?")
7. Dataframe to be analyzed: row per venue; columns for category, lat, long, likes, rating, and proximity of each subpopulation
8. Use machine learning algorithms such as regression and decision tree to develop model for predicting the number of likes and the rating of a venue based on category and subpopulation promixities
9. Identify "untapped markets" of census tracts where a restaurant could do well by mapping census tracts w/ their predicted number of likes and rating, along with the number of existing venues in that census tract

## For the second week, the final deliverables of the project will be:

**A link to your Notebook on your Github repository, showing your code. (15 marks)**

**A full report consisting of all of the following components (15 marks):**
+ Introduction where you discuss the business problem and who would be interested in this project.
+ Data where you describe the data that will be used to solve the problem and the source of the data.
+ Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, and what machine learnings were used and why.
+ Results section where you discuss the results.
+ Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
+ Conclusion section where you conclude the report.

**Your choice of a presentation or blogpost. (10 marks)**