<a href="https://colab.research.google.com/github/alexhosp/startup-viability-analysis/blob/main/customer-segmentation/notebooks/01_google_trends_collection_cleaning__eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Google Trends: Data Collection & Cleanining & EDA
## Introduction
This notebook focuses on collecting data to identify and segment potential customers for a proposed AI-driven gardening robot startup. We will use the Pytrends API to gather data from Google Trends. Data will be cleaned and then stored in Google Cloud Storage (GCS).
## Steps
1. [Setup and GCS Configuration](#setup-gcs-configuration)
2. [Definition of Initial Keywords](#initial-keywords-definition)
3. [Data Collection](#data-collection)
    1. [API Authentication and Configuration](#api-authentication-configuration)
    2. [Defining Data Collection Functions](#defining-data-collection-functions)
    3. [Collect Relevant Data](#collect-relevant-data)
        1. [Collect Interest Over Time in Proposed Features](#collect-interest-features)
        2. [Collect Interest Over Time in Problems Addressed (Relevance)](#collect-interest-problems)
        3. [Collect Interest Over Time in Needs Adressed (Relevance)](#collect-interest-needs)
4. [Data Cleaning and Transformation](#clean-and-transform-data)
      1. [Clean and Transform Interest Over Time in Features Data](#clean-transform-features)
      2. [Clean and Transform Relevance of Problems Data](#clean-transform-problems)
      3. [Clean and Transform Relevance of Needs Data](#clean-transform-needs)
    5. [Explore the Data](#explore-the-data)
        1. [Explore Interest Over Time in Proposed Features](#explore-features)
            1. [Identify Features with Highest Interest](#highest-interest-features)
            2. [Determine Weekly Average Interest per Keyword](#weekly-average-features)
            3. [Visualize Change in Interest Over Time](#change-in-interest-features)
            4. [Visualize Most Relevant Features](#relevant-features)
        2. [Explore Relevance of Problems](#explore-problems)
            1. [Identify Problems with Highest Relevance](#highest-interest-problems)
        3. [Explore Relevance of Needs](#explore-needs)
6. [Data Storage](#evaluate-and-select-relevant-data)
    1. [Store Data in GCS](#store-data-in-gcs)


<a name="setup-gcs-configuration"></a>
## Setup and GCS Configuration
* Install necessary libraries
* Authenticate and access GCS
* Set up a GCS bucket for storage

In [None]:
# Install necessary libraries
!pip install google-cloud-storage

In [None]:
# Import necessary libraries
from google.cloud import storage
import pandas as pd
from google.colab import auth

In [None]:
# Authenticate with GCP
auth.authenticate_user()

In [None]:
# Set up GCS client
project_id = 'idyllic-gear-422709-g4'
storage_client = storage.Client(project=project_id)

In [None]:
# Create a new bucket to store all data in

# Define the bucket name
bucket_name = 'startup-viability-analysis'

# Check if the bucket already exists
bucket = storage_client.bucket(bucket_name)

# Create the bucket if it does not exist

bucket = storage_client.bucket(bucket_name)
if not bucket.exists():
  bucket.storage_class = 'STANDARD'
  bucket = storage_client.create_bucket(bucket, location='us-east1')
  print(
        "Created bucket {} in {} with storage class {}".format(
            bucket.name, bucket.location, bucket.storage_class
        )
    )
else:
  print(f"Bucket {bucket_name} already exists")




In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

<a name="initial-keywords-definition"></a>
# Definition of Initial Keywords
The initial keywords are selected to gather data on interest in the prototype's features and the relevance of the problems and needs it aims addresses.

## Read in the keyword lists from GitHub

In [None]:
# Install necessary libraries
!pip install requests

In [None]:
# Import necessary libraries
import requests

In [None]:
# Define file paths
base_url = 'https://raw.githubusercontent.com/alexhosp/startup-viability-analysis/main/customer-segmentation/data/raw/'
files = ['features.txt', 'problems.txt', 'needs.txt']

# Retrieve text files and convert them to Python lists
features = requests.get(base_url + files[0]).text.splitlines()
problems = requests.get(base_url + files[1]).text.splitlines()
needs = requests.get(base_url + files[2]).text.splitlines()
print("Number of features:", len(features))
print("Number of problems:", len(problems))
print("Number of needs:", len(needs))

### ⚠️ Google Trends Data Collection Issues ⚠️


Due to recent changes on the Google Trend page, the data collection of `trends_over_time `, using the `Pytrends` library is currently facing issues. The specific problem is a 429 error, causing all requests to fail.

I have attempted various solutions such as:
- Adding randomized sleep times between requests
- Mimicking browser headers
- Updating the Pytrends library

Despite these efforts, the issue persists. Therefore, I'm using data collected on July 1st 2024 (when this code still worked) for the purposes of this demonstration. I'm monitoring the situation and will update the code once a solution is found.

For more details and to track progress on this issue, please visit the [GitHub Issue](https://github.com/GeneralMills/pytrends/issues/625).

---

** ⚠️ Important**: If you are viewing this notebook and would like to attempt to run the code, please be aware that it may not function as intended due to these external limitations.

<a name='data-collection'></a>
# Data Collection
Collect data from Google Trends using the defined keywords.

<a name='api-authentication-configuration'></a>
# Pytrends API Authentication & Configuration

In [None]:
# Install necessary libraries
!pip install pytrends
!pip install urllib3==1.25.11

In [None]:
# Import pytrends
from pytrends.request import TrendReq

In [None]:
# Define custom headers to mimic a real browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Connection': 'keep-alive',
    'Referer': 'https://www.google.com'
}


In [None]:
# Initialize Pytrends
pytrends = TrendReq(hl='en-US', tz=360, retries=3, requests_args={'headers': headers})


In [None]:
# Import datetime
from datetime import datetime, timedelta

In [None]:
# Define timeframe for the analysis as 3 years
end_date = datetime.now()
start_date = end_date - timedelta(days=3*365)
end_date
start_date_str = start_date.strftime('%Y-%m-%d')
end_date_str = end_date.strftime('%Y-%m-%d')
timeframe = f'{start_date_str} {end_date_str}'
timeframe

<a name='defining-data-collection-functions'></a>
# Defining Data Collection Functions
* Define functions to collect data from Google Trends
* Fetch interest over time in features, problems and needs


## Collect interest over time
* Find out about interest in proposed features
* Find out about interest in problems (relevance)
* Find out about interest in needs (relevance)

In [None]:
# Define the batch size
batch_size = 5


# Define a function to split lists into batches
def split_into_batches(lst, batch_size):
   return [lst[i:i+batch_size] for i in range(0, len(lst), batch_size)]


In [None]:
# Split features list into 5 keyword batches
features_chunks = split_into_batches(features, batch_size)
print("Number of chunks:", len(features_chunks))

In [None]:
# Split problems list into 5 keyword batches
problems_chunks = split_into_batches(problems, batch_size)
print("Number of chunks:", len(problems_chunks))

In [None]:
# Split needs list into 5 keyword batches
needs_chunks = split_into_batches(needs, batch_size)
print("Number of chunks:", len(needs_chunks))

In [None]:
import traceback
# Define a function to fetch interest over time for a list of keyword chunks
def fetch_interest_over_time(pytrends, chunks, timeframe='today 5-y'):
    """
    Fetch interest over time data from Google Trends for given chunks of keywords.

    Args:
    pytrends (TrendReq): An authenticated Google Trends API client.
    features_chunks (list of list of str): A list where each element is a list of keywords.
    timeframe (str): The timeframe for the Google Trends data (default: 'today 5-y').

    Returns:
    list of pd.DataFrame: A list of DataFrames with the interest over time data for each chunk of keywords.
    """
    dataframes = []

    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i+1}/{len(chunks)}: {chunk}")
        try:
          pytrends.build_payload(kw_list=chunk, timeframe=timeframe)
          df = pytrends.interest_over_time()
          df = df.iloc[:-1]  # Remove the last row (most recent, partial week)
          if 'isPartial' in df.columns:
            df = df.drop(columns=['isPartial'])
          # Check dataframe shape
          print(f"DataFrame shape: {df.shape}")
          dataframes.append(df)
          print(f"Processed chunk {i+1}/{len(chunks)}")
        except Exception as e:
          print(f"Error processing chunk {i+1}/{len(chunks)}: {e}")
          # traceback.print_exc()
          continue
    # Check the lenght of the list of dataframes
    print(f"Number of dataframes: {len(dataframes)}")
    # Combine dataframes into one if the list is not empty
    if dataframes:
      dataframe = pd.concat(dataframes, axis=1)
      print(f"Combined dataframe shape: {dataframe.shape}")
      return dataframe
      # Return an empty dataframe in the case of an error
    else:
      print("No dataframes to combine, returning an empty dataframe.")
      return pd.DataFrame()  # Return an empty DataFrame if no DataFrames were created


<a name='collect-relevant-data'></a>
# Collect relevant data
* Use data collection functions to collect data from Google Trends

<a name='collect-interest-features'></a>
## Collect Interest Over Time in Proposed Features

In [None]:
# Get a dataframe with interest over time in all features
features_interest_over_time_df = fetch_interest_over_time(pytrends=pytrends, chunks=features_chunks, timeframe=timeframe)
features_interest_over_time_df.tail()

<a name='collect-interest-problems'></a>
## Collect Interest Over Time in Problems Addressed by the Prototype

In [None]:
# Get a dataframe with interest over time in all problems (relevance)
problems_interest_over_time = fetch_interest_over_time(pytrends=pytrends, chunks=problems_chunks, timeframe=timeframe)
problems_interest_over_time.tail()

<a name='collect-interest-needs'></a>
## Collect Interest Over Time in Needs Addressed by the Prototype

In [None]:
# Get a dataframe with interest in time in all needs (relevance of needs)
needs_interest_over_time = fetch_interest_over_time(pytrends=pytrends, chunks=needs_chunks, timeframe=timeframe)
needs_interest_over_time.tail(3)

<a name='clean-and-transform-data'></a>
# Clean and Transform Data
* Clean and transform all collected data for analysis
* Remove duplicate and irrelevant entries
* Sort data and create new relevant features


<a name='clean-transform-features'></a>
## Clean and Transform Interest Over Time in Proposed Features Data
* Ensure the data does not contain duplicate or irrelevant entries.
* Remove empty columns and store them in a dataframe for further analysis

In [None]:
features_interest_over_time_df.info()

In [None]:
# Check if there are any duplicated rows
duplicate_rows = features_interest_over_time_df.duplicated().sum()
duplicate_rows

In [None]:
# Check for empty columns
empty_columns = (features_interest_over_time_df == 0).all().sum()
empty_columns

### Identify and store keywords with no interest
* Either these keywords need to be improved, or the features are of minimal value to consumers
* Store features with zero interest in a separate DataFrame for further analysis

In [None]:
# Calculate the sum of values in each column and sort in descending order
column_sums = features_interest_over_time_df.sum()

# Determine keywords with no interest
zero_interest_keywords = column_sums[column_sums == 0].index.tolist()
zero_interest_keywords

# Create a dataframe with zero interet keywords
zero_interest_features = pd.DataFrame(zero_interest_keywords, columns=['zero_interest_keywords'])
zero_interest_features.head()

### Remove Zero Interest Columns


In [None]:
# Identify and Remove Columns with Zero Sum
features_interest_over_time = features_interest_over_time_df.loc[:, features_interest_over_time_df.sum() != 0]
features_interest_over_time.sum().sort_values(ascending=False).tail()

<a name='clean-transform-problems'></a>
## Clean and Transform Relevance of Problems Data
* Ensure the data does not contain duplicate or irrelevant entries.
* Remove empty columns and store them in a dataframe for further analysis

In [None]:
problems_interest_over_time.info()

In [None]:
problems_interest_over_time.shape[0]

In [None]:
# Reset the index to include the date in the duplication check
temp_df = problems_interest_over_time.reset_index()
temp_df.head()

In [None]:
# Check for duplicates where both the date and all field values are duplicated
duplicate_rows = temp_df.duplicated().sum()
duplicate_rows

In [None]:
# Check for empty columns
empty_columns = (features_interest_over_time_df == 0).all().sum()
empty_columns

### Identify and store keywords with no interest
* Either these keywords need to be improved, or the problems are of minimal interest to consumers
* Store problems with zero interest in a separate DataFrame for further analysis

In [None]:
# Sum values to determine relevance
relevance_problems_interest_over_time = problems_interest_over_time.sum()

# Identify problems with zero relevance
zero_relevance_problems = relevance_problems_interest_over_time[relevance_problems_interest_over_time == 0]

# Create a new datafame containing zero interest keywords
zero_relevance_problems_df = pd.DataFrame(zero_relevance_problems.index, columns=['zero_interest_keywords'])
zero_relevance_problems_df.head()


### Remove Zero Interest Columns


In [None]:
# Remove zero relevance problems from dataframe
problems_interest_over_time = problems_interest_over_time.loc[:, problems_interest_over_time.sum() != 0]
problems_interest_over_time.sum().sort_values(ascending=False).tail()


<a name='clean-transform-needs'></a>
## Clean and Transform Relevance of Needs Data
* Ensure the data does not contain duplicate or irrelevant entries.
* Remove empty columns and store them in a dataframe for further analysis

In [None]:
needs_interest_over_time.info()

In [None]:
# Check for duplicate rows

# Create a temporary dataframe with the date index as a column
temp_df = needs_interest_over_time.reset_index()

# Check for duplicates where both the date and all field values are duplicated
duplicate_rows = temp_df.duplicated().sum()
duplicate_rows

In [None]:
# Check for columns where all values are zero
empty_columns = (needs_interest_over_time == 0).all().sum()
empty_columns

### Identify and store keywords with no interest
* Either these keywords need to be improved, or the needs are of minimal relevance to consumers
* Store needs with zero interest in a separate DataFrame for further analysis

In [None]:
# Sum values to determine relevance of each need
relevance_needs_interest_over_time = needs_interest_over_time.sum()
relevance_needs_interest_over_time.head()

### Identify and store keywords with no interest
* Either these keywords need to be improved, or the needs are of minimal concern to consumers
* Store needs with zero interest in a separate DataFrame for further analysis

In [None]:
# Remove needs with zero relevance and store in a new dataframe
zero_relevance_needs = relevance_needs_interest_over_time[relevance_needs_interest_over_time == 0]
zero_relevance_needs = pd.DataFrame(zero_relevance_needs.index, columns=['zero_interest_keywords'])
zero_relevance_needs.head()

### Remove Zero Interest Columns


In [None]:
# Create a new series of needs by relevance without zero interest needs
needs_by_relevance = relevance_needs_interest_over_time[relevance_needs_interest_over_time != 0]
needs_by_relevance.tail()

In [None]:
# Create a new dataframe with zero relevance needs removed
needs_interest_over_time = needs_interest_over_time.loc[:, (needs_interest_over_time != 0).any(axis=0)]
needs_interest_over_time.sum().sort_values(ascending=False).tail()
needs_interest_over_time.head(3)

<a name='explore-the-data'></a>
# Explore the Data
* Conduct initial exploration to understand the data collected
* Generate summary statistics and visualizations  
* Identify patterns in data

<a name='explore-features'></a>
## Explore Interest Over Time in Proposed Features
* Identify features with the highest interest
* Calculate and visualize weekly average interest
* Generate summary statistics and visualizations

<a name='highest-interest-features'></a>
### Identify Features with Highest Interest
* Calculate the sum of interest values for each feature.
* Create a table sorted by relevance (highest interest).

In [None]:
# Calculate the sum of values in each column and sort in descending order
sorted_by_interest = features_interest_over_time.sum().sort_values(ascending=False)

# Format keywords into a table
features_by_interest = pd.DataFrame(data=sorted_by_interest, columns=['interest'])
features_by_interest.head()

<a name='weekly-average-features'></a>
### Determine Weekly Average Interest per Keyword
* Data has been collected over a period of 3 years or 156 weeks.
* Google Trends determines the popularity of a keyword per week on a scale from 0 to 100.
* Calculate the average weekly interest to get values in a range from 0 to 100 for all features with more than zero interest.
* Add the output as a new column `weekly_average_interest` in the `features_by_interest `DataFrame.

In [None]:
# Determine the average weekly interest to get values in a range from 0 to 100 for all features with more than zero interest
features_by_interest['weekly_average_interest'] = round(features_by_interest['interest'] / 156.42, 3)
features_by_interest.head()

<a name='change-in-interest-features'></a>
### Visualize How Interest in Proposed Features Changed Over Time

In [None]:
# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Create dataframe with monthly time intervals

# Resample the data to monthly intervals and sum the values
features_interest_over_time_monthly = features_interest_over_time.resample('M').sum()

# Format the index to show month and year
features_interest_over_time_monthly.index = features_interest_over_time_monthly.index.strftime('%B %Y')
features_interest_over_time_monthly.head(3)

In [None]:
# Create a heatmap to understand how interest changed over time

# Set the figure size and resolution
plt.figure(figsize=(16, 16), dpi=300)

# Create the heatmap plot
sns.heatmap(
    data=features_interest_over_time_monthly.T,
    cmap='viridis', annot=False,
    cbar=True
    )

# Adjust size for x and y axis labels
plt.yticks(fontsize=14)
plt.xticks(fontsize=14);

# Add a title
plt.title('Interest in Proposed Features Over Time', fontsize=20);

# Add labels for the x and y axes
plt.xlabel('Month', fontsize=16)
plt.ylabel('Features', fontsize=16);

<a name='relevant-features'></a>
### Visualize most relevant features
* Prepare the dataframe for visualization
* Create a bar plot to visualize the overall interest in each feature

In [None]:
# Flatten the dataframe
features_by_interest_flat = features_by_interest.reset_index()
features_by_interest_flat.head(3)

In [None]:
# Visualize overall interest in proposed features

# Set the figure size and resolution for the bar plot
plt.figure(figsize=(16, 8), dpi=300)

# Create a bar plot to compare overall interest in all proposed features
sns.barplot(
    data=features_by_interest_flat,
    x='index',
    y='interest',
    hue='index',
    palette='Set2'
    )

# Rotate x-axis labels for better readability
plt.xticks(rotation=90, fontsize=12);

# Set plot title and axis labels
plt.title('Overall Interest in Proposed Features', fontsize=20)
plt.xlabel('Features', fontsize=16)
plt.ylabel('Interest', fontsize=16);

<a name='explore-problems'></a>
## Explore Relevance of Problems Over Time
* Relevance of problems is defined as interest over time
* Identify problems with highest relevance
* Visualize how relevance of problems changed over time
* Determine weekly average interest (weekly relevance of problems)


<a name='highest-interest-problems'></a>
### Identify Problems with Highest Relevance
* Calculate the sum of interest values for each problem
* Create a table sorted by relevance (highest interest)

In [None]:
# Calculate the sum of interest values for each problem and sort them by relevance
relevance_problems_interest_over_time = problems_interest_over_time.sum().sort_values(ascending=False)
relevance_problems_interest_over_time.head()

### Visualize How Relevance of Problems Changed Over Time
* Prepare the data for visualization
* Create a heatmap to show interest in keywords over time

In [None]:
# Resample dataframe to show interest in monthly intervals
problems_interest_over_time_monthly = problems_interest_over_time.resample('M').sum()
problems_interest_over_time_monthly.index = problems_interest_over_time_monthly.index.strftime('%B %Y')

problems_interest_over_time_monthly.head(3)

In [None]:
# Create a heatmap to understand how interest changed over time

# Set the figure size and resolution
plt.figure(figsize=(4, 4), dpi=150)

# Create a heatmap to visualize change in relevance of problems over time
sns.heatmap(
    data=problems_interest_over_time_monthly.T,
    cmap='viridis',
    annot=False,
    cbar=True
    )
plt.yticks(fontsize=9)
plt.xticks(fontsize=9);

### Visualize how interest in problems increased over time
Determine whether awareness of problems is increasing or decreasing.


In [None]:
# Resample the dataframe to yearly intervals for less noise
problems_interest_over_time_yearly = problems_interest_over_time.resample('Y').sum()
problems_interest_over_time_yearly.index = problems_interest_over_time_yearly.index.strftime('%Y')
problems_interest_over_time_yearly.head()

# Remove columns with only zero values
problems_interest_over_time_yearly = problems_interest_over_time_yearly.loc[:, (problems_interest_over_time_yearly != 0).any()]

In [None]:
# Transform the dataframe so that interest is represented as a colulum
problems_interest_over_time_flat = problems_interest_over_time_yearly.reset_index().melt(id_vars='date', var_name='problem', value_name='interest')
problems_interest_over_time_flat.head()

In [None]:
# Visualize interest over time in high relevance problems with more than zero interest
plt.figure(figsize=(16, 8), dpi=400)
sns.lineplot(data=problems_interest_over_time_flat, x='date', y='interest', hue='problem', palette='Set2');
# Rotate x-axis labels for better readability
plt.xticks(rotation=90, fontsize=12);

# Add title and place legend outside the plot
plt.title('Relevance of Problems over Time')
plt.legend(bbox_to_anchor=(1.05, 1), fontsize=14);

In [None]:
# Resample the dataframe to monthly intervals
problems_interest_over_time_monthly = problems_interest_over_time.resample('M').sum()
problems_interest_over_time_monthly.index = problems_interest_over_time_monthly.index.strftime('%Y-%m')
problems_interest_over_time_monthly.head()

In [None]:
# Remove columns with only zero values
problems_interest_over_time_monthly = problems_interest_over_time_monthly.loc[:, (problems_interest_over_time_monthly != 0).any()]
problems_interest_over_time_monthly.head()

In [None]:
# Transform the dataframe to long format
problems_interest_over_time_monthly_flat = problems_interest_over_time_monthly.reset_index().melt(id_vars='date', var_name='problem', value_name='interest')
problems_interest_over_time_monthly_flat.head()

In [None]:
# Plot the same monthly for more granularity
plt.figure(figsize=(16, 8), dpi=400)
sns.lineplot(data=problems_interest_over_time_monthly_flat, x='date', y='interest', hue='problem', palette='husl');
# Rotate x-axis labels for better readability
plt.xticks(rotation=90, fontsize=12);

# Add title and place legend outside the plot
plt.title('Relevance of Problems over Time')
plt.legend(bbox_to_anchor=(1.05, 1), fontsize=14);


In [None]:
# Get weekly average interest for the problems
problems_by_relevance = pd.DataFrame(problems_by_relevance, columns=['interest'])
problems_by_relevance.head()
problems_by_relevance['weekly_average_interest'] = round(problems_by_relevance['interest'] / 156.42, 3)
problems_by_relevance.head()

In [None]:
# Rename columns
problems_by_relevance.rename(columns={'weekly_average_interest': 'problems_weekly_average_interest'}, inplace=True)
features_by_interest.rename(columns={'weekly_average_interest': 'features_weekly_average_interest'}, inplace=True)

# Create a new dataframe with weekly average interest in features and problems as columns
features_problems_interest = pd.concat([problems_by_relevance['problems_weekly_average_interest'], features_by_interest['features_weekly_average_interest']], axis=1)
features_problems_interest.head()

## Explore relevance (interest over time) in needs addressed

### Identify Problems with Highest Relevance
* Calculate the sum of interest values for each problem
* Create a table sorted by relevance (highest interest)

In [None]:
# Sort needs by relevance
relevance_needs_interest_over_time = needs_interest_over_time.sum().sort_values(ascending=False)
relevance_needs_interest_over_time.head()

In [None]:
# Resample the dataframe to show interest in monthly intervals
needs_interest_over_time_monthly = needs_interest_over_time.resample('M').sum()
needs_interest_over_time_monthly.index = needs_interest_over_time_monthly.index.strftime('%B %Y')
needs_interest_over_time_monthly.tail()

In [None]:
# Create a heatmap to understand how interest changed over time
plt.figure(figsize=(4, 4), dpi=150)
sns.heatmap(data=needs_interest_over_time_monthly.T, cmap='viridis', annot=False, cbar=True);
plt.yticks(fontsize=9)
plt.xticks(fontsize=9);

In [None]:
# Visualize the overall relevance of needs over the last 3 years
plt.figure(figsize=(16, 8), dpi=400)
sns.lineplot(data=needs_interest_over_time_monthly, palette='Set2');
plt.xticks(rotation=90, fontsize=12);

### Visualize overall relevance of needs in comparison to one another
Determine which needs were the most relevant over the last three years.

In [None]:
# Flatten the dataframe
needs_by_relevance_flat = needs_by_relevance.reset_index()
needs_by_relevance_flat.columns.values[1] = 'interest'
needs_by_relevance_flat.head()

In [None]:
# Visualize relevance of needs in comparison using a bar chart
plt.figure(figsize=(16, 8), dpi=400)
sns.barplot(data=needs_by_relevance_flat, x='index', y='interest', hue='index', palette='Set2');
plt.xticks(rotation=90, fontsize=12);

## Visualize how interest in needs differs from from interest in features and interest in problems
* Which needs,problems or features have the overall highest interest?
* Compare the weekly average of all keywords searched


In [None]:
# Get weekly average interest in features and create a new dataframe
features_by_interest['weekly_average_interest'] = round(features_by_interest['interest'] / 156.42, 3)
features_by_interest.head()

In [None]:
# Get average monthly interest for problems and create a new dataframe
problems_by_relevance = pd.DataFrame(problems_by_relevance, columns=['interest'])
problems_by_relevance.head()
problems_by_relevance['weekly_average_interest'] = round(problems_by_relevance['interest'] / 156.42, 3)
problems_by_relevance.head()

In [None]:
# Get average monthly interest in needs and create a new dataframe
needs_by_relevance = pd.DataFrame(needs_by_relevance, columns=['interest'])
needs_by_relevance['weekly_average_interest'] = round(needs_by_relevance['interest'] / 156.42, 3)
needs_by_relevance.head()

In [None]:
# Create a dataframe with weekly average interest in all keywords

# Reset the index and create a keyword column for all dataframes
features_by_interest_flat = features_by_interest.reset_index()
problems_by_relevance_flat = problems_by_relevance.reset_index()
needs_by_relevance_flat = needs_by_relevance.reset_index()

# Concatenate as rows to a new dataframe
all_keywords = pd.concat([features_by_interest_flat, problems_by_relevance_flat, needs_by_relevance_flat], axis=0)
all_keywords.rename(columns={'index': 'keyword'}, inplace=True)
all_keywords.head()

# Sort the dataframe by weekly average interest
all_keywords = all_keywords.sort_values(by='weekly_average_interest', ascending=False)
all_keywords.tail()


In [None]:
# Visualize weekly average interest in all keywords
plt.figure(figsize=(16, 8), dpi=400)
sns.barplot(data=all_keywords, x='keyword', y='weekly_average_interest', hue='keyword', palette='Set2');
plt.xticks(rotation=90, fontsize=12);

<a name='evaluate-and-select-relevant-data'></a>
# Evaluate and Select Relevant Data
* Evaluate the explored data and decide which parts are valuable for further analysis



## Evaluate Relevant Data on Interest in Features
* **Weekly interest in features**: Valuable as this data provides easy-to-understand metrics (0 - 100) and offers insights into which prototype features are most relevant out of the ones that received interest.
* **Interest in featues over time**: Relevant for analyzing how interest in features changes over time. Includes features with zero interest.
* **Zero interest features**: Need further evaluation to understand why these keywords received no interest.

## Storage Plan on Interest in Features
1. **Features by Average Weekly Interest**:
  * Dataframe: `features_by_interest`
  * Description: Contains the average weekly interest for each relevant feature keyword
  * Reason: Provides a metric for comparing the relevance of features to customers
  * File name: `features_by_interest.parquet`
2. **Interest in Features Over Time**:
  * Dataframe: `features_interest_over_time`
  * Description: Contains weekly interest in features over a period of 3 years. Contains features with zero interest.
  * Reason: Helps understanding trends and changes in interest over time.
  File name: `features_interest_over_time.parquet`
3. **Features with Zero Interest**:
  * Dataframe: `zero_interest_features`
  * Description: Contains keywords from the feature list that returned zero interest.
  * Reason: To determine whether the lack of interest is due to the keywords used or the features themselves.
  * File name: `zero_interest_features.parquet`

## Evaluate Relevant Data on Interest in Problems (relevance of problems)
* **Weekly interest in problems**: Valuable as this data provides easy-to-understand metrics (0 - 100) and offers insights into which problems the prototype aims to address are most relevant out of the ones that received interest.
* **Interest in problems over time**: Relevant for analyzing how the relevance of problems changes over time. Includes problems with zero interest.
* **Zero interest problems**: Need further evaluation to understand why these keywords received no interest.

## Storage Plan on Interest in Problems
1. **Problems by Average Weekly Interest**:
  * Dataframe: `problems_by_relevance`
  * Description: Contains the average weekly interest for each relevant problem keyword
  * Reason: Provides a metric for comparing the relevance of problems to customers
  * File name: `problems_by_relevance.parquet`
2. **Relevance of Problems Over Time**:
  * Dataframe: `problems_relevance_over_time`
  * Description: Contains weekly interest in features over a period of 3 years. Contains features with zero interest.
  * Reason: Helps understanding trends and changes in interest over time.
  File name: `problems_relevance_over_time.parquet`
3. **Problems with Zero Relevance**:
  * Dataframe: `zero_relevance_problems`
  * Description: Contains keywords from the problems list that returned zero interest.
  * Reason: To determine whether the lack of interest is due to the keywords used or the problems themselves.
  * File name: `zero_relevance_problems.parquet`

## Evaluate Relevant Data on Interest in Needs (relevance of needs)
* **Weekly interest in Needs**: Valuable as this data provides easy-to-understand metrics (0 - 100) and offers insights into which needs the prototype aims to address are most relevant out of the ones that received interest.
* **Interest in Needs Over Time**: Useful for analyzing how the relevance of needs changes over time. Includes needs with zero interest.
* **Zero Interest Needs**: Needs further evaluation to understand why these keywords received no interest.

## Storage Plan on Interest in Needs
1. **Needs by Average Weekly Interest**:
  * Dataframe: `needs_by_relevance`
  * Description: Contains the average weekly interest for each relevant need keyword
  * Reason: Provides a metric for comparing the relevance of needs to customers
  * File name: `needs_by_relevance.parquet`
2. **Relevance of Need Over Time**:
  * Dataframe: `needs_interest_over_time_all`
  * Description: Contains weekly relevance of needs over a period of 3 years. Includes needs with zero interest.
  * Reason: Helps understanding trends and changes in interest over time.
  File name: `needs_interest_over_time.parquet`
3. **Needs with Zero Relevance**:
  * Dataframe: `zero_relevance_needs`
  * Description: Contains keywords in the needs list that returned zero interest.
  * Reason: To determine whether the lack of interest is due to the keywords used or the needs themselves.
  * File name: `zero_relevance_needs.parquet`

<a name='store-data-in-gcs'></a>
# Store Data in GCS
* Store cleaned and transformed data in Google Cloud Storage
* Save data in a suitable format
* Upload the data to a designated GCS storage bucket

In [None]:
# Define a function to upload dataframes as parquet to GCS
def upload_dataframe_to_gcs(dataframe, subdirectory, blob_name, project_id=project_id):
    """
    Saves a DataFrame as a Parquet file and uploads it to a specified GCS bucket and subdirectory.

    Parameters:
    dataframe (pd.DataFrame): The DataFrame to be saved and uploaded.
    bucket_name (str): The name of the GCS bucket.
    subdirectory (str): The subdirectory within the GCS bucket where the file will be stored.
    blob_name (str): The name of the blob (file) in GCS.
    project_id (str): The Google Cloud project ID.

    Returns:
    None
    """
    # Save the DataFrame as a Parquet file locally in the /content directory
    temp_path = f'/content/{blob_name}'
    dataframe.to_parquet(temp_path)

    # Verify the file was saved locally
    !ls -lh {temp_path}

    # Upload the Parquet file to GCS in the specified subdirectory
    blob = bucket.blob(f'{subdirectory}/{blob_name}')
    blob.upload_from_filename(temp_path)

    print(f"File {subdirectory}/{blob_name} uploaded to {bucket_name}.")


## Store Data on Interest in Features

### Store features by average weekly interest

In [None]:
# Define subdirectory for all Google Trends data
subdirectory = 'google_trends'

In [None]:
# Upload the dataframe as parquet to GCS
upload_dataframe_to_gcs(dataframe=features_by_interest, subdirectory=subdirectory, blob_name='features_by_interest.parquet')

In [None]:
features_by_interest.head(3)

### Store weekly interest over time in features

In [None]:
features_interest_over_time = features_interest_over_time_df
# Upload the dataframe as parquet to GCS
upload_dataframe_to_gcs(dataframe=features_interest_over_time, subdirectory=subdirectory, blob_name='features_interest_over_time.parquet')

In [None]:
features_interest_over_time.head(3)

### Store features with no interest

In [None]:
# Upload the dataframe as parquet to GCS
upload_dataframe_to_gcs(dataframe=zero_interest_features, subdirectory=subdirectory, blob_name='zero_interest_features.parquet')

In [None]:
zero_interest_features.head(3)

## Store Data on Relevance of Problems

### Store Problems by Average Weekly Interest

In [None]:
# Upload the dataframe as parquet to GCS
upload_dataframe_to_gcs(dataframe=problems_by_relevance, subdirectory=subdirectory, blob_name='problems_by_relevance.parquet')

In [None]:
problems_by_relevance.head(3)

### Store weekly relevance over three years in problems

In [None]:
problems_relevance_over_time = problems_interest_over_time
upload_dataframe_to_gcs(dataframe=problems_relevance_over_time, subdirectory=subdirectory, blob_name='problems_relevance_over_time.parquet')

### Store Problems with No Relevance  

In [None]:
zero_relevance_problems = zero_relevance_problems_df
upload_dataframe_to_gcs(dataframe=zero_relevance_problems, subdirectory=subdirectory, blob_name='zero_relevance_problems.parquet')

In [None]:
zero_relevance_problems.head(3)

## Store Data on Relevance of Needs

## Store Needs by Average Weekly Interest

In [None]:
# Upload the dataframe as parquet to GCS
upload_dataframe_to_gcs(dataframe=needs_by_relevance, subdirectory=subdirectory, blob_name='needs_by_relevance.parquet')

In [None]:
needs_by_relevance.head(3)

### Store weekly relevance data for needs over the past three years

In [None]:
# Upload the dataframe as parquet to GCS
upload_dataframe_to_gcs(dataframe=needs_interest_over_time_all, subdirectory=subdirectory, blob_name='needs_interest_over_time.parquet')

In [None]:
needs_interest_over_time_all.head(3)

### Store Needs with No Relevance


In [None]:
# Upload the dataframe as parquet to GCS
upload_dataframe_to_gcs(dataframe=zero_relevance_needs, subdirectory=subdirectory, blob_name='zero_relevance_needs.parquet')

In [None]:
zero_relevance_needs.head()