## Technical report requirements
1. Clearly defined problem statement/goal.
2. Description of data, including the source.
3. Description of relevant data cleaning and munging.
        a. Code included where appropriate.
4. EDA relevant to your problem statement.
        a. Record of any changes to data informed by EDA.
        b. Record of quality assessment and any assumptions that you are making based on the EDA.
5. Description of model selection.
        a. This should address why you have selected the model or models to answer your question
        b. Why the choice is appropriate.
6. Well-documented code of modeling you performed to answer your question.
7. Description of results.
        a. Includes relevant code, metrics, and visualizations that describe your results.
        b. Explanation of all metrics and charts.
8. Conclusion.
        a. Summarize your findings, addressing the strengths and weaknesses of your approach.
        b. Recommendations and future directions.


NOTE: Your code should be well-organized, well-documented, and limited to only the code that is relevant in the report. If you have a lot of parsing/munging code that distracts from the flow of the report you can keep this in another submitted file and mention it or link to it. Make sure to explain why you perform coding steps along the way.

## Goal:

Selecting a bottle of wine, whether for yourself or a friend can be a daunting task. What if there were taste profiles that you could look towards with a preset list of types of wines you would enjoy to choose from? That's the goal of my project, to create taste profiles and make the process of selecting wine more enjoyable.

## The Data:

I began the process of data collection by sourcing many different websites that had curated lists of wines. I eventually landed on using the site vivino.com as it had one of the largest databases of wine. I set up four different scraping notebooks split by red wine and white wine. 

Data Gathering process:

Step 1: I needed to use webdriver from Selenium to pull my links for each wine page. I scraped the site vivino for the urls to each wine and saved every 100 instances to a txt file. I ran this code for both red wine links and white wine links separately.

Step 2: Once I had all of the links, I set up iterations that ran through each url and utilizing xpaths grabbed the underlying wine details.

Step 3: Using the same wine urls, I set up another iteration to run through each page and using both webdriver and xpaths, I was able to gather the data for the wine, user and reviews of the particular wine.

Step 4: I used the OS package in python to find all the iterations of raw data files that I gathered and concatenated them into 4 datasets. 

Key Takeaways/Lessons Learned:
It was important to save every 100 instances to a file since the scraping itself took hours to complete. At one point I lost connection and so it was necessary to save as I went to be able to pick up where the data gathering process stopped. Running wine details and wine reviews in separate notebooks and splitting each by red and white wine was imperative to getting the scraping finished so that I could run the notebooks in parallel. 

Data I have to work with:

4 csv files:

   1. Red Wine Details: wine name, winery, region, country, avg rating and avg price
   2. White Wine Details: wine name, winery, region, country, avg rating and avg price
   3. Red Wine Reviews: wine, user and reviews
   4. White Wine Reviews: wine, user and reviews

Here is the link to the raw data gathering code:
[Raw Data](https://github.com/divyasusarla/DSI-SF-2-divyasusarla/tree/master/Capstone/Raw_Data)

## Data Cleaning:

Cleaning the wine details (both red and white) data steps:
1. Create an index for the wine (to be able to plot)
2. Change column headers to match the column information
3. Convert average ratings and average price data from objects to floats
4. Pull out year from the wine names
5. Clean up wine, wineries, regions, countries from unicode.

In both the details datasets, where there were missing values, I converted them to null (NaN) values.

Initial cleaning of the reviews data:
1. Created an index for the review
2. Removed the \n characters from the review text.

Here is the link to the data cleaning code:
[Clean Data](https://github.com/divyasusarla/DSI-SF-2-divyasusarla/tree/master/Capstone/Clean_Data)

## EDA relevant to creating wine taste profiles

To get a better sense of the wine details data I plotted the wines to see the spread across prices, countries, regions, year and ratings. I didn't make any further changes to the details data through plotting. 

Here is the link to the plotting:
[Plotting Code](https://github.com/divyasusarla/DSI-SF-2-divyasusarla/tree/master/Capstone/Plotting%20EDA)


For the reviews data, I have constructed LDA topic modeling to get features with the highest probability of being in a review. The topic modeling was the toughest piece to refine the documents passed through the topic model to garner fruitful information. I am still in the process of refining this aspect and constructing the white wine topic model. 

Here is the link to the topic modeling:
[Red Wine LDA analysis](https://github.com/divyasusarla/DSI-SF-2-divyasusarla/tree/master/Capstone/LDA%20analysis)
Will add White Wine LDA

# Model, Results, Conclusion

I will be working with clustering algorithms to create my taste profiles. Stay tuned!