**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Dylan Atianzar
- Nicholas Skrable
- Rashi Haria
- Jasneet Singh

# Research Question

How does a consumer's age, gender, frequency of browsing on Amazon, interaction with Amazon's personalized recommendations, and perception toward customer reviews affect their shopping satisfaction on Amazon in 2023?

## Background and Prior Work

**Background:**

This project examines whether product reviews influence consumer purchasing decisions. Reviews play an important role for various businesses, such as restaurants and the service industry, where feedback from previous customers can significantly impact the decisions of potential customers regarding where to dine or obtain services. We are interested in the relationship of product reviews and consumers. 

Many factors can go into an individual’s decision into buying something such as personal research into the product or the product being recommended to them, but we believe previous reviews are the most significant factor. A study in 2022 utilizing eye tracking software experimented with reviews for a phone and whether a participant ended up buying the phone. The researchers found that consumers pay more attention towards negative reviews than the positives correlating with whether they do purchase the product or not. They also uncovered that there’s a difference between the importance of reviews between men and women. The men were primarily focused on the hardware specifications of a phone while the women factored the reviews into their decision [1]. Initially, we hadn’t thought about the subgroups within consumers so there’s also more to reveal in that area. 

Another study conducted in 2022 looks through the results of a survey to gain insights into the relationship between consumer purchases and its reviews. It found that consumers judge reviews based on their credibility. This credibility can be derived from the volume of reviews, the language used by the reviewer and general average score of the reviews. These elements all lead to whether or not reviews end up being a contributing factor of a consumer’s purchase decision [2]. 

Generally, these studies state that consumers do take note of product reviews and their significance toward a consumer. The research does indicate a correlation between a customer’s purchase decision and the product’s overall sentiment. Furthermore, we recognize the credibility of these reviews factor into a consumer’s purchase decision. However, we aim to find the extent to which these reviews affect a consumer’s purchase decision and if the impact of these reviews changes between the types of product.


**References:**
1. Bizrate Insights. (2023, December 11). The Impact of Customer Reviews on Purchase Decisions. https://bizrateinsights.com/resources/the-impact-of-customer-reviews-on-purchase-decisions/ 
2. Fernandes, S., Panda, R., et al. (2022, July 9). Measuring the impact of online reviews on consumer purchase decisions – A scale development study. Journal of Retailing and Consumer Services. https://www.sciencedirect.com/science/article/pii/S096969892200159X 
3. Chen, T., Samaranayake, P., et al. (2022, May 2). The impact of online reviews on consumers’ purchasing decisions: Evidence from an eye-tracking study. Frontiers. https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2022.865702/full

# Hypothesis


Our hypothesis is that consumers' consideration of product reviews significantly influences their purchase decisions, with the impact varying across different product categories. Specifically, we predict that:

1. **High performance and quality critical products:** For categories such as electronics and appliances, where product performance and quality are crucial, consumers are more likely to rely heavily on product reviews. This reliance is due to the need for assurance about the product's reliability and functionality
2. **High priced items:** Higher priced products are more dependent on favorable reviews for sales. Consumers tend to conduct thorough research to avoid 'buyer's remorse', resulting in a higher influence of reviews on their purchasing decisions
3. **Preference based products:** in contrast, for categories like cosmetics and fashion, where personal preference plays a larger role, the influence of product reviews is relatively lower. Consumers in these categories may prioritize individual tastes and trends over reviews

#### Rationale

This hypothesis is grounded in the assumption that consumers seek to minimize risk and maximize satisfaction in their purchasing decisions. Reviews, both from peers and experts, provide critical information that helps consumers make informed choices. For high-performance, quality critical, and high priced items, the perceived risk is higher, prompting a stronger reliance on reviews. On the other hand for preference based products, the decision making process is more subjective, and reviews play a less central role.


# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- Dataset #2 (if you have more than one!)
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
- etc

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

## Dataset: Amazon Consumer Behavior

- Dataset Name: Amazon Customer Behavior Dataset
- Link to the dataset: [https://www.kaggle.com/datasets/swathiunnikrishnan/amazon-consumer-behaviour-dataset/data](https://www.kaggle.com/datasets/swathiunnikrishnan/amazon-consumer-behaviour-dataset/data)
- Number of observations: 602 observations
- Number of variables: 23 variables

## Setup:

In [7]:
# import pandas and numpy library
import pandas as pd
import numpy as np

# import plot libraries
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns

# improve resolution
%config InlineBackend.figure_format ='retina'

In [8]:
consumer_behavior = pd.read_csv('Amazon-Customer-Behavior-Survey.csv')
consumer_behavior.head()

Unnamed: 0,Timestamp,age,Gender,Purchase_Frequency,Purchase_Categories,Personalized_Recommendation_Frequency,Browsing_Frequency,Product_Search_Method,Search_Result_Exploration,Customer_Reviews_Importance,...,Saveforlater_Frequency,Review_Left,Review_Reliability,Review_Helpfulness,Personalized_Recommendation_Frequency.1,Recommendation_Helpfulness,Rating_Accuracy,Shopping_Satisfaction,Service_Appreciation,Improvement_Areas
0,2023/06/04 1:28:19 PM GMT+5:30,23,Female,Few times a month,Beauty and Personal Care,Yes,Few times a week,Keyword,Multiple pages,1,...,Sometimes,Yes,Occasionally,Yes,2,Yes,1,1,Competitive prices,Reducing packaging waste
1,2023/06/04 2:30:44 PM GMT+5:30,23,Female,Once a month,Clothing and Fashion,Yes,Few times a month,Keyword,Multiple pages,1,...,Rarely,No,Heavily,Yes,2,Sometimes,3,2,Wide product selection,Reducing packaging waste
2,2023/06/04 5:04:56 PM GMT+5:30,24,Prefer not to say,Few times a month,Groceries and Gourmet Food;Clothing and Fashion,No,Few times a month,Keyword,Multiple pages,2,...,Rarely,No,Occasionally,No,4,No,3,3,Competitive prices,Product quality and accuracy
3,2023/06/04 5:13:00 PM GMT+5:30,24,Female,Once a month,Beauty and Personal Care;Clothing and Fashion;...,Sometimes,Few times a month,Keyword,First page,5,...,Sometimes,Yes,Heavily,Yes,3,Sometimes,3,4,Competitive prices,Product quality and accuracy
4,2023/06/04 5:28:06 PM GMT+5:30,22,Female,Less than once a month,Beauty and Personal Care;Clothing and Fashion,Yes,Few times a month,Filter,Multiple pages,1,...,Rarely,No,Heavily,Yes,4,Yes,2,2,Competitive prices,Product quality and accuracy


## Data Cleaning and Wrangling

In [11]:
# Choose which columns we want
chosen_data = consumer_behavior[['age','Gender','Purchase_Frequency','Personalized_Recommendation_Frequency','Browsing_Frequency','Customer_Reviews_Importance','Review_Reliability','Review_Helpfulness','Recommendation_Helpfulness','Shopping_Satisfaction']]
chosen_data

Unnamed: 0,age,Gender,Purchase_Frequency,Personalized_Recommendation_Frequency,Browsing_Frequency,Customer_Reviews_Importance,Review_Reliability,Review_Helpfulness,Recommendation_Helpfulness,Shopping_Satisfaction
0,23,Female,Few times a month,Yes,Few times a week,1,Occasionally,Yes,Yes,1
1,23,Female,Once a month,Yes,Few times a month,1,Heavily,Yes,Sometimes,2
2,24,Prefer not to say,Few times a month,No,Few times a month,2,Occasionally,No,No,3
3,24,Female,Once a month,Sometimes,Few times a month,5,Heavily,Yes,Sometimes,4
4,22,Female,Less than once a month,Yes,Few times a month,1,Heavily,Yes,Yes,2
...,...,...,...,...,...,...,...,...,...,...
597,23,Female,Once a week,Sometimes,Few times a week,4,Moderately,Sometimes,Sometimes,4
598,23,Female,Once a week,Sometimes,Few times a week,3,Heavily,Sometimes,Sometimes,3
599,23,Female,Once a month,Sometimes,Few times a week,3,Occasionally,Sometimes,Sometimes,3
600,23,Female,Few times a month,Yes,Few times a month,1,Heavily,Yes,Yes,2


In [11]:
chosen_data['Customer_Reviews_Importance'].value_counts()

Customer_Reviews_Importance
3    216
1    169
2    115
4     64
5     38
Name: count, dtype: int64

# Ethics & Privacy

#### Ethical considerations:
In our project, we have undertaken a thorough look of ethical concerns throughout the entire research process. Our commitment is to ensure that every step, from the creation of our research question to the communication of results, adheres to the highest ethical standards.

#### Data collection and bias considerations
During the data collection process, we evaluated ethical concerns, particularly regarding the dataset's source and its implications. We selected the Amazon Consumer Behavior dataset, which was collected through online surveys predominantly conducted in the Indian subcontinent. While this dataset provides valuable real world insights, we do believe there are some limitations in terms of generalizability. The survey population primarily includes people who use Amazon and are willing to share information about their shopping habits. Consequently, the findings may not be representative of the entire consumer population (due to various statistical underrepresentations) or applicable to regions outside the Indian subcontinent (or Indians in Europe and Asia), as specified by the survey description.

#### Mitigating potential biases
To address these biases, we took several measures:
1. **Transparency with participants:** We ensured that in the data set we chose, survey participants were explicitly informed about the survey's purpose and the intended use of their data (survey was conducted through a dedicated Google Doc)
2. **Exploratory data analysis:** We conducted thorough exploratory data analysis to identify any anomalies, missing values, and patterns that could indicate biases
3. **Data cleaning:** We cleaned the dataset to minimize biases, ensuring that the data used in our analysis is as accurate and unbiased as possible

#### Ethical handling of data
We are committed to addressing any biases, privacy concerns, and terms of use issues associated with our dataset:
- **Bias detection and mitigation:** We will employ statistical techniques and bias detection tools to identify and mitigate biases before, during, and after the analysis. This includes evaluating the demographic composition of our dataset and adjusting our analysis methods to account for any identified biases
- **Privacy:** we have implemented robust measures to protect the privacy of survey participants. This includes anonymizing data and ensuring that any personally identifiable information is removed/obfuscated
- **Equitablity:** We're mindful of the potential impact of our findings on different populations. We will strive to communicate our results transparently, highlighting any limitations and ensuring that our conclusions are equitable and do not generalize across different demographics

#### Addressing ethical issues
Our group acknowledges that no dataset is entirely free from biases. However we're committed to continuous improvement and ethical vigilance. Throughout our project, we will do the following as needed:
- **Monitor for ethical concerns:** Regularly review our methodology and results for any emerging ethical issues
- **Adjust methodologies:** Be prepared to adjust our analysis methods to address any identified biases or ethical concerns
- **Transparent communication:** Communicate our findings with full transparency, acknowledging the limitations and potential biases of our dataset


# Team Expectations 

* *Team Expectation 1* : All team members will be communicative when it comes to project work and meeting up.
* *Team Expectation 2* : All team members are expected to contribute to the workload of the project. If not, a meeting will be held in order to discuss the issue. If that fails, TA or staff from the class will be involved. 
* *Team Expectation 3* : We will be using both discord and messages to communicate. 

# Project Timeline Proposal

We will usually meet weekly on Wednesdays usually at 11AM or 12PM.

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 5/1  |  12 PM | Brainstorm topics | Discuss final project topic and hypothesis; Assign sections of the Project Proposal and review work | 
| 5/3  |  12 PM | Everyone's contribution to the proposal | Discuss final proposal and submit | 
| 5/8  |  11 AM |  Completed project proposal and reviewed all components | Search for data sets and decide on the one; Begin data cleaning | 
| 5/15  | 11 AM  | Brainstorm ideas on how to wrangle, Review feedback on proposal  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part; Discuss how to fix Project Proposal feedback |
| 5/22  | 11 AM  | Import & Wrangle Data ; EDA | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 5/29  | 11 AM  | Finalize wrangling/EDA; Begin Analysis  | Discuss/edit Analysis|
| 6/5  | 12 PM  | Complete analysis; Draft results/conclusion/discussion | Discuss/edit full project |
| 6/12  | Before 11:59 PM  | NA | Review entire project; Turn in Final Project |