<a href="https://colab.research.google.com/github/alexhosp/startup-viability-analysis/blob/main/customer-segmentation/notebooks/02_google_trends_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Google Trends: Data Analysis
## Introduction
This notebook analyzes Google Search trends for keywords related to our proposed prototype features as well as problems and user needs addressed by the prototype. By identifying the most relevant keywords and observing how interest in these keywords has changed over the last three years, we can gather insights that will guide future customer segmentation analyis.

## Contents
Contents
1. [Data Overview](#data-overview)
  1. [Loading Data from GCS](#loading-data)
  2. [Description of Loaded Data](#description-of-available-data)
  3. [Brief Evaluation of Keywords with No Interest](#data-no-interest)
2. [Key Features Analysis](#key-features-analysis)
  1. [Definition of Most Relevant Features](#most-relevant-features)
  2. [Detailed Trend Analysis of Key Features](#key-features-trend-analysis)
    1. [Moderate Interest Features](#trends-moderate-features)
    2. [High Interest Features](#trends-high-features)
    3. [Critical Interest Features](#trends-critical-features)
  3. [Visualizations: Interest Over Time for Key Features](#features-visualizations)
    1. [Moderate Interest Features](#features-moderate-visualizations)
    2. [High Interest Features](#features-high-visualizations)
    3. [Critical Interest Features](#features-critical-visualizations)
3. [Key Problems Analysis](#key-problems-analysis)
  1. [Definition of Most Relevant Problems](#most-relevant-problems)
  2. [Detailed Trend Analysis of Key Problems](#key-problems-trend-analysis)
    1. [Critical Interest Problems](#trends-critical-problems)
  3. [Visualizations: Interest Over Time for Key Problems](#problems-visualizations)
    1. [Critical Interest Problems](#problems-critical-visualizations)
4. [Key Needs Analysis](#key-needs-analysis)
  1. [Definition of Most Relevant Needs](#most-relevant-needs)
  2. [Detailed Trend Analysis of Key Needs](#key-needs-trend-analysis)
    1. [Moderate Interest Needs](#trends-moderate-needs)
    2. [High Interest Needs](#trends-high-needs)
    3. [Critical Interest Needs](#trends-critical-needs)
  3. [Visualizations: Interest Over Time for Key Needs](#needs-visualizations)
    1. [Moderate Interest Needs](#visualizations-moderate-needs)
    2. [High Interest Needs](#visualizations-high-needs)
    3. [Critical Interest Needs](#visualizations-critical-needs)
5. [Contextual Exploration of Most Relevant Keywords](#contextual-exploration)
  1. [Summary Table: Most Relevant Keywords](#most-relevant-keywords)
  2. [Exploration of Interest Context](#categories)
6. [Presentation of Results and Insights](#presentation-of-results)
  1. [Keyword Performance Results](#keyword-performance)
  2. [Interest Trend Analysis Results](#interest-trends)
  3. [Contextual Exploration Results](#contextual-exoloration-results)
  4. Visualization of Key Insights
7. Conclusion
  1. Summary of Key Insights
  2. Relevance for Further Analysis
  3. Relevance for Market Understanding
  4. Relevance for Prototype Development
8. Next Steps
  1. Future Analysis Plan

<a name='data-overview'></a>
## Data Overview
* Load data from GCS.
* Describe structure, format and contents of data.
* Evaluate keywords that did not receive interest and will not be included in the analysis.


<a name='loading-data'></a>
## Loading Data from GCS
* Import necessary libraries
* Authenticate and access GCS
* Load data from storage bucket

In [None]:
# Import necessary libraries
from google.cloud import storage
import pandas as pd
from google.colab import auth

In [None]:
# Authenticate with GCP
auth.authenticate_user()


In [None]:
# Set up GCS client
project_id = 'idyllic-gear-422709-g4'
storage_client = storage.Client(project=project_id)

In [None]:
# Define storage location
bucket_name = 'startup-viability-analysis'
bucket = storage_client.get_bucket(bucket_name)
bucket

In [None]:
# Define sub-folder name
base_path = 'google_trends/'

In [None]:
# Define function to load parquet files from GCS
def load_parquet(bucket, file_path):
  """
  Load a Parquet file from a Google Cloud Storage bucket into a pandas DataFrame.

  Args:
      bucket (google.cloud.storage.bucket.Bucket): The Google Cloud Storage bucket object.
      file_path (str): The path to the Parquet file within the GCS bucket.

  Returns:
      pd.DataFrame: The loaded data as a pandas DataFrame.

  Example:
      df = load_parquet_from_gcs(bucket, 'path/to/your/file.parquet')
  """
  # Create a Blob object for the specified file path within the bucket
  blob = bucket.blob(file_path)

  # Open the blob as a file object and read it into a pandas DataFrame
  with blob.open("rb") as f:
      df = pd.read_parquet(f)
  return df

### Load Data for Keywords with No Relevance

In [None]:
# Load dataframe with features with no interest
zero_relevance_features = load_parquet(bucket, base_path + 'zero_interest_features.parquet')
zero_relevance_features.head(2)

In [None]:
# Load dataframe with problems with no interest (relevance)
zero_relevance_problems = load_parquet(bucket, base_path + 'zero_relevance_problems.parquet')
zero_relevance_problems.head(2)

In [None]:
# Load dataframe with needs with no interest (relevance)
zero_relevance_needs = load_parquet(bucket, base_path + 'zero_relevance_needs.parquet')
zero_relevance_needs.head(2)

### Load Data for Keywords Sorted by Relevance

In [None]:
# Load features by interest dataframe
features_by_interest = load_parquet(bucket, base_path + 'features_by_interest.parquet')
features_by_interest.head(2)

In [None]:
# Load problems by relevance dataframe
problems_by_relevance = load_parquet(bucket, base_path + 'problems_by_relevance.parquet')
problems_by_relevance.head(2)

In [None]:
# Load needs by relevance dataframe
needs_by_relevance = load_parquet(bucket, base_path + 'needs_by_relevance.parquet')
needs_by_relevance.head(2)

### Load Data For Keywords Interest Over Time

In [None]:
# Load data for features interest over time
features_interest_over_time = load_parquet(bucket, base_path + 'features_interest_over_time.parquet')
features_interest_over_time.tail(2)

In [None]:
# Load data for problems relevance over time
problems_relevance_over_time = load_parquet(bucket, base_path + 'problems_relevance_over_time.parquet')
problems_relevance_over_time.tail(2)

In [None]:
# Load data for needs relevance over time
needs_relevance_over_time = load_parquet(bucket, base_path + 'needs_interest_over_time.parquet')
needs_relevance_over_time.tail(2)

<a name='description-of-available-data'></a>
## Description of Loaded Data
* Describe metrics used by Google Trends
* Describe the format of the data
* Describe the structure of the data
* Describe the contents of the data

#### Understanding Google Trends Metrics:
Google Trends data is sampled and rated on a scale from 1 to 100, with 100 representing the highest search interest for the selected time and location. This data is normalized to reflect the percentage of searches for a specific term relative to the total number of searches at that time and location, rather than showing the absolute number of searches. This approach provides a relative measure of interest, indicating how popular a search term is compared to overall search volume in the given context. I evaluated searches worldwide over a period of 3 years.

[More information](https://newsinitiative.withgoogle.com/resources/trainings/google-trends/basics-of-google-trends/#:~:text=Understanding%20the%20numbers,-By%20now%2C%20we&text=Indexing%3A%20Google%20Trends%20data%20is,the%20time%20and%20location%20selected.)


### Data Format
All data is stored in the `google_trends` subfolder in Google Cloud Storage (GCS) as Parquet files. These files were loaded into the notebook using a function that reads them into pandas DataFrames.



### Data Structure and Contents
The dataset comprises three different data structures: keywords with no relevance, keywords sorted by overall relevance, and keywords sorted by time. Each structure is consistently formatted across features, problems, and needs.


#### Structure 1: Keywords with No Relevance
##### Description:
* The dataframes `zero_relevance_features`, `zero_relevance_problems`, and `zero_relevance_needs` share this structure.
* These dataframes contain keywords that received no significant search interest over the last three years.

##### Index:
The dataframe uses a sequential numerical index.
##### Columns:

- `zero_interest_keywords`: Contains the keyword as searched with the Google Trends API. The datatype is a string.

##### Example Data:

In [None]:
# Example data for feature keywords with no relevane
zero_relevance_features.head(3)

In [None]:
# Example data for problems keywords with no relevance
zero_relevance_problems.head(3)

In [None]:
# Example data for needs problems with zero interest
zero_relevance_needs.head(3)

#### Structure 2: Keywords by Relevance
##### Description:
* The dataframes `features_by_interest`, `problems_by_relevance`, and `needs_by_relevance` share this structure.
* These dataframes contain keywords with substantial interest, sorted in descending order by aggregated interest over the last 3 years.
* The dataframe contains no null-values.

##### Index:
The dataframe uses the keywords as the index.
##### Columns:

- `interest`: Interest collected in weekly intervals over three years summed up. The datatype is an int64.
- `weekly_average_interest`: Average weekly interest over the 3-year period, scaled from 1 to 100. This column provides a measure of relative interest similar to the one used by Google Trends, but it represents the average over the entire time period. Rounded to 3 decimals. The datatype is a float64.

##### Example Data:

In [None]:
# Example data for features sorted by relevance
features_by_interest.head(3)

In [None]:
# Example data for problems sorted by relevance
problems_by_relevance.head(3)

In [None]:
# Example data for needs sorted by relevance
needs_by_relevance.head(3)

#### Structure 3: Keywords by Weekly Interest
##### Description:
* The dataframes `features_interest_over_time`, `problems_relevance_over_time` and `needs_relevance_over_time` share this structure.
* These dataframes contain the weekly interest for each keyword over a period of 3 years, with interest aggregated in weekly intervals by Google Trends.
* The dataframes contains no empty columns.

##### Index:
The dataframe uses the date as an index, representing weekly intervals.
##### Columns:
Each column name represents a keyword for which interest is collected weekly over a period of 3 years.

##### Example Data:

In [None]:
# Example data for interest over time in features
features_interest_over_time.head(3)

In [None]:
# Example data for relevance over time of problems
problems_relevance_over_time.head(3)

In [None]:
# Example data for relevance over time of needs
needs_relevance_over_time.tail(3)

<a name='data-no-interest'></a>
## Brief Evaluation of Keywords with No Relevance
- Explore the dataframes containing keywords that received no significant search interest over the last three years.
- Analyze a sample of keywords to determine if the lack of interest is due to the concepts they represent or if the keywords were not chosen well.
- Assess the overall number of keywords that received no interest.


### Assessing Keywords Representing Proposed Features
- Use Google Trends to check if there are related queries that are more popular

In [None]:
zero_relevance_features

Observations:
* All keywords in the list consist of more than 3 words.
* Many keywords use complex technical jargon ('hexapod mobility and agility', 'beneficial ecosystem modelling'..)
* Some keywords are highly specific, or niche keywords ('adaptive algorithms for environmental conditions')

#### Identify Alternative Keywords with Similar Meanings
* Use the `pytrends.suggestions` to find related search suggestions that could phrase ideas more simply and generally.
* Identify keywords with similar meaning that yield higher search volumes.

**Initialize the Pytrends API**

In [None]:
# Install and import necessary libraries
!pip install pytrends
!pip install urllib3==1.25.11
from pytrends.request import TrendReq

In [None]:
# Initialize the Pytrends object
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Connection': 'keep-alive',
    'Referer': 'https://www.google.com'
}

pytrends = TrendReq(hl='en-US', tz=360, retries=3, requests_args={'headers': headers})

# Define the timeframe (same as all data previously collected)
timeframe = '2021-06-21 2024-06-23'

**Create a list of keywords that with zero interest**

In [None]:
# Show number of features keywords with zero interest
keywords = zero_relevance_features['zero_interest_keywords'].tolist()
len(keywords)

In [None]:
# Initialize an empty list to store the DataFrames
suggestions_list = []

In [None]:
# Loop through list of keywords
for keyword in keywords:
    # Get suggestions for the keyword
    suggestions = pd.DataFrame(pytrends.suggestions(keyword))

    # Drop the 'mid' column
    if 'mid' in suggestions.columns:
      suggestions.drop(columns=['mid'], inplace=True)

    # Add a column for the original keyword
    suggestions['keyword'] = keyword

    # Append the DataFrame to the list
    suggestions_list.append(suggestions)

**Analyze Alternative Keywords for 'Versatile for various agricultural needs'**

In [None]:
# Return the search first term in the list
suggestions_list[0]['keyword'][0]

In [None]:
# Return the suggested searches and evaluate briefly
suggestions_list[0]

**Interpretation of Results: 'Versatile for various agricultural needs'**

**Observation:**

A topic associated with the feature 'versatile for various agricultural needs' is drone technology. Additionally, Vetsark, a technology company focused on improving animal health and agriculture through digital solutions, also appears. The same suggestion 'Exploring the Skies...' appears four times in the results, which indicates extremely limited data for this keyword.

**Conclusion**

This interest is addressed by some farming technologies. However, the feature described by the keyword seems to be a **niche interest**. The same suggestion, a drone guide, appears several times. The related keywords suggested do not offer any clearer alternatives for this search term.

**Analyze Alternative Keywords for 'Organic gardening helper'**

In [None]:
# Briefly evaluating the next search term
suggestions_list[1]
# Return the search term
suggestions_list[1]['keyword'][0]

In [None]:
# Return suggested terms for this keyword
suggestions_list[1]

**Interpretation of Results: 'Organic gardening helper'**

**Observation:**
Suggested terms for 'Organic gardening helper' include 'The Harvest Helper', a book about organic gardening, indoor gardening equipment, and various gardening supplies like paper pulp seed trays and jute string for training plants. None of these is related to robotic technology. One suggestion appears twice, indicating very limited data on this keyword.

**Conclusion:**

The suggested terms for "organic gardening helper" primarily relate to traditional gardening tools, books, and supplies. This suggests that the term is not strongly associated with technological solutions like the proposed robotic gardening assistant. While there is some interest in organic gardening help, the current phrasing doesn't effectively capture the potential market for a high-tech gardening tool. The related keyword suggestions don't provide useful alternatives for the keyword 'Organic gardening helper'. Keywords that address the technological aspect more might be better fit (such as 'robotic gardening assistant').

**Analyze Alternative Keywords for 'Hexapod mobility and agility'**

In [None]:
suggestions_list[2]['keyword'][0]

In [None]:
suggestions_list[2]

**Interpretation of Results: 'Hexapod mobility and agility'**

**Observation:**
The only suggested topic for 'Hexapod mobility and agility' is a scientific paper titled "Control of the Stiffness of Robotic Appendages." This indicates a narrow interest primarily within the scientific and robotics communities.

**Conclusion:**

The limited results suggest that 'Hexapod mobility and agility' is a niche interest with low general appeal. The suggestions do not offer alternative, more effective, phrases. Splitting the keyword into simpler terms like 'hexapod', 'mobility', and 'agility' could improve its effectiveness and broaden the potential audience.

#### Conclusion
* The sample of 3 keywords that received no search interest, don't have strong alternative keywords that could increase their effectiveness.
* Keywords chosen are often niche intersts, too specific or too technical.
* A good next step is to evaluate the suggestions provided by Pytrends and analyze which niches keywords appeal to and then reduce their complexity, by splitting and rephrasing.

#### Hypothesis: The more words in a keyword, the lower the search interest.
* It can be observed that all keywords that received no interest consist of 3 or more words.
* Analyze if the word count in keywords affects interst levels.


In [None]:
# Return full list of keywords with no interst for the features
zero_relevance_features

**Observation:**

All keywords in the *features* group with zero search interest contain more than three words.

Check keywords with no interest for the other areas
* Inspect keywords in the problems group that received no interest.
* Inspect keywords in the needs group that received no interest.

In [None]:
# Return all problems with zero relevance
zero_relevance_problems

**Observation:**

All keywords in the *problems* group with zero search interest contain more than three words.

In [None]:
# Return all needs with zero relevance
zero_relevance_needs

**Observation:**

All keywords in the *needs* group with zero search interest contain more than four words.

**Compare successful keywords in the *features* group**

In [None]:
# Compare successful keywords
features_by_interest.head()

**Observation:**

The top four keywords with the highest interest are all two-word phrases. However, the fifth keyword, 'Remote access and control', consisting of four words, still demonstrates significant interest.

**Conclusion:**

While the number of words in a keyword can influence its search interest, it's not the sole determining factor. Other factors, such as the keyword's relevance to current trends and its specificity, also play a role in driving interest. This suggests that a combination of conciseness and relevance is key to maximizing search interest.

**Compare successful keywords in the *problems* group**

In [None]:
# Check the other areas of interest
problems_by_relevance.head()

**Observation:**
The keyword with the highest interest, 'Monoculture in agriculture',consists of three words and has significantly higher weekly average interest (15) compared to the other keywords, which all have six or more words and weekly average interest between 0.5 and 1.

**Conclusion:**

The data suggests a potential negative correlation between the number of words in a keyword and the level of search interest. Keywords with fewer words, receive significantly higher interest compared to longer, more complex phrases.

**Compare successful keywords in the *needs* group**

In [None]:
needs_by_relevance.head(7)

**Observation:**

The keyword with the highest interest, "Climate resilience," consists of two words. Keywords with three words ("Educational AI," "Sustainable food production") have moderately lower interest, while those with four or more words show significantly less weekly average interest.

**Conclusion:**

The data suggests a potential negative correlation between the number of words in a keyword and the level of search interest. Keywords with fewer words tend to garner higher interest compared to longer, more complex phrases. While exceptions exist due to other factors influencing interest, the overall pattern supports the idea that shorter keywords are generally more likely to attract attention.

#### Conclusion:
The data analyzed across all keyword groups demonstrates that shorter keywords tend to generate higher search interest compared to longer, more complex phrases. While other factors like relevance and specificity also influence interest, reducing the word count of low-performing keywords is a good first step to improve their effectiveness.

<a name='key-features-analysis'></a>
## Key Features Analysis
- Show and evaluate the proposed features that are most relevant to potential customers.
- Analyze trends, fluctuations, and consistency in interest for most relevant features.
- Visualize and evaluate interest over time for the key features.

<a name='most-relevant-features'></a>
### Determine Features with Highest Relevance
- Present features sorted by their relevance.
- Determine features with significant relevance.
- Establish thresholds to classify features into the following categories:
  - Critical: Essential and highly relevant features.
    - Average weekly interest 75% - 100%
  - High: Very important and significantly relevant features.
    - Average weekly interest: 50% - 75%
  - Moderate: Important but not essential features.
    - Average weekly interest: 10% - 50%
  - Low: Somewhat relevant but not crucial features.
    - Average weekly interest: 0% - 10%
- These values are relative to the highest weekly average interest value in the dataset.

In [None]:
# Show proposed feature keywords by relevance table
features_by_interest.head(3)

In [None]:
# Define a function to compute each keyword's relative interest (compared to the highest interest) and classify them into relevance levels.

def get_relative_interest(df):
    """
    Calculate the relative interest of each row in relation to the highest interest row,
    and categorize them into relevance levels.

    Parameters:
    df (pd.DataFrame): A DataFrame containing a 'weekly_average_interest' column.

    Returns:
    pd.DataFrame: The input DataFrame with two additional columns:
                  'relative_interest' - the interest as a percentage of the maximum interest,
                  'relevance' - a categorical column indicating the level of interest:
                                'Low', 'Moderate', 'High', or 'Critical'.
    """
    # Get the highest average weekly interest
    max_interest = df['weekly_average_interest'].max()

    # Add a column with the relative interest as a percentage for each datapoint
    df['relative_interest'] = (df['weekly_average_interest'] / max_interest * 100).round().astype(int)

    # Define edges for the four bins
    bin_edges = [-0.001, 10, 50, 75, 100]

    # Define labels for the four bins
    labels = ['Low', 'Moderate', 'High', 'Critical']

    # Create a new categorical column with relevance categories
    df['relevance'] = pd.cut(df['relative_interest'], bins=bin_edges, labels=labels, include_lowest=True)

    # Drop the intermediate 'relative_interest' column
    df.drop(columns=['relative_interest'], inplace=True)

    return df

In [None]:
# Classify feature keyword into interest categories
features_by_interest = get_relative_interest(features_by_interest)
features_by_interest.head(3)

In [None]:
# Show data in the last bin as a percentage
total_features = len(features_by_interest)
low_interest_features = len(features_by_interest[features_by_interest['relevance'] == 'Low'])

low_relevance = (low_interest_features / total_features) * 100
print(f"Percentage of features in the 'Low' category: {round(low_relevance)}%")

#### Results
- 80% of Proposed Features Have Low Relevance:
  - These features have an average weekly interest below 10%.
  - These features should not be considered for the prototype.
- Moderate Relevance:
  - One proposed feature, 'Food System Health', has moderate relevance.
  - Patterns in interest  should be further analyzed to determine if interest is rising.
- High Relevance:
  - Three features have high relevance:
      - 'Educational Value'
      - 'Plant Identification'
      - 'Remote Access and Control'
  - Patterns and context of interest in these features need to be analyzed. These features are likely to attract customers and are valuable for future analysis.
- Critical Relevance:
  - Two proposed features have critical relevance:
    - 'Solar Powered'
    - 'Plant Detection'
  - Patterns and context of interest needs to be analyzed in detail. These features have a strong customer base and are essential for customer segmentation analysis, market analysis, and competitor analysis. They are also candidates for inclusion in the first prototype (MVP).

<a name='key-features-trend-analysis'></a>
## Detailed Trend Analysis of Key Features
- Analyze trends in interest for features with moderate to critical relevance
  - Calculate and interpret summary statistics
    - Count
    - Mean
    - Standard Deviations
    - Minimum Interest
    - Maximum Interest
    - Quartiles
    - Variance (inferred)
  - Identify Trends
    - Determine if interest over the last 3 years is increasing, decreasing or stable.
    - Determine peaks and troughts in interest.
    - Determine if interest is currently rising or falling.
- Interpret analysis results

### Trend Analyis
- Analyze trends in interest for features with moderate, high and critical interest.

<a name='trends-moderate-features'></a>
#### Trend Analysis for Features with Moderate Interest
- Assess whether interest has been increasing, decreasing, or stable over the past 3 years.
- Identify the peaks and troughs in interest levels.
- Determine the current trend in interest, whether it is rising or falling.

**Summary Statistics**
* Evaluate summary statistics to get an overview of the data's characteristics.


In [None]:
# Show features with moderate interest
features_by_interest[features_by_interest['relevance'] == 'Moderate']

In [None]:
# Define a function that returns columns in the features_interest_over_time dataframe by interest level in the features_by_relevance dataframe

def get_relevant_keywords_df(features_by_interest, features_interest_over_time, relevance_level):
    """
    Returns a DataFrame with the relevant columns for a specified relevance level.

    Parameters:
    features_by_interest (pd.DataFrame): DataFrame containing feature relevance information with 'relevance' column.
    features_interest_over_time (pd.DataFrame): DataFrame containing weekly interest data for features.
    relevance_level (str): The relevance level to filter by.

    Returns:
    pd.DataFrame: DataFrame with the relevant columns in features_interest_over_time.
    """
    # Create a list of features with the specified relevance level
    relevant_features = features_by_interest[features_by_interest['relevance'] == relevance_level].index

    # Find matching columns in the interest over time dataframe
    matching_columns = [col for col in features_interest_over_time.columns if col in relevant_features]

    # Return the matching columns in the interest over time dataframe
    relevant_features_df = features_interest_over_time[matching_columns]

    return relevant_features_df



In [None]:
# Get a dataframe for all features with 'Moderate' interest
moderate_interest_df = get_relevant_keywords_df(features_by_interest, features_interest_over_time, 'Moderate')
moderate_interest_df.head(3)

In [None]:
# Show summary stats for high interest features
summary_stats_moderate_interest = moderate_interest_df.describe().round().astype(int)
summary_stats_moderate_interest

**Observations:**
* Interest tends to cluster around a moderate value of 15.
* The standard deviation is low (4 units), indicating consistent interest over time.
* The lowest recorded interest is 3, with no weeks showing zero interest.
* The quartiles show little variation, with 25th percentile at 11, median at 15, and 75th percentile at 18.
* The maximum interest level of 26 indicates at least one instance of 'High' interest.

**Conclusion:**

The data suggests a consistent and reliable moderate interest in the 'Food system health' feature over the past three years. The low standard deviation and lack of zero-interest weeks confirm continuous interest. While the interest level has reached a 'High' point at least once, the overall pattern suggests a low fluctuations of interest over time.

In [None]:
moderate_interest_df.head(3)

**Linear Regression Trend Analysis**
* Assess change in interest over time using a statistical model.
* Evaluate suitability of keywords for further analysis and prototype development based on general trend over the last 3 years.

In [None]:
# Import necessary libraries
from scipy.stats import linregress

In [None]:
# Define a function that assesses interest trends using linear regression analysis

def analyze_interest_trends(df, slope_decimals=2):
    """
    Analyzes interest trends over time for all columns in a DataFrame.

    Parameters:
    df (pd.DataFrame): DataFrame containing a date index and interest columns for each keyword.
    slope_decimals (int): Number of decimal places to round the slope to.

    Prints:
    - Slope of the linear regression
    - P-value of the linear regression
    - Yearly increase in interest for each column
    """
    # Create a copy of the original DataFrame
    df_ordinal = df.copy()

    # Reset the index to convert the date index to a column
    df_ordinal.reset_index(inplace=True)

    # Convert the date column to ordinal
    df_ordinal['date_ordinal'] = df_ordinal['date'].apply(lambda x: x.toordinal())

    # Perform linear regression for each column
    for column_name in df.columns:
        if column_name == 'date':
            continue

        result = linregress(df_ordinal['date_ordinal'], df_ordinal[column_name])

        # Extract the slope and p-value
        slope = result.slope
        p_value = result.pvalue

        # Print the results
        print(f"Analysis for '{column_name}':")
        print(f"Slope: {slope.round(slope_decimals)}")
        print(f"P-value: {p_value}")
        print(f'Yearly change in interest: {round(slope * 365, 2)} units')
        print("-" * 50)

In [None]:
analyze_interest_trends(moderate_interest_df)

**Observation:**
* Interest in the proposed feature 'Food system health' is increasing at a yearly rate of ~1.74 units
* The p-value ~3.35*e^-6 indicates that this positive trend is statistically significant.

**Conclusion:**
* This increase in interest makes the proposed feature relevant for future analysis.


**Trend Analysis Using Peak Detection**

In [None]:
# Import necessary libraries
from scipy.signal import find_peaks

In [None]:
# Define a function to calculate peaks and throughs in the dataset
def get_peaks_and_troughs(df):
    """
    Analyzes peaks and troughs for each column in the given DataFrame.

    Parameters:
    df (pd.DataFrame): DataFrame containing interest data with dates as the index.

    Returns:
    dict: Dictionary containing DataFrames with peaks and troughs as boolean columns for each interest column.
    """
    result_dict = {}

    # Iterate over each column in the DataFrame
    for column in df.columns:
        # Identify peaks
        peaks, _ = find_peaks(df[column])  # Returns an array of local maxima

        # Identify troughs by inverting the data
        troughs, _ = find_peaks(-df[column])  # Returns an array of local minima

        # Create a DataFrame to store the results for peaks and troughs
        analysis_df = pd.DataFrame({
            'Date': df.index,
            'Interest': df[column],
            'Peak': [True if i in peaks else False for i in range(len(df))],
            'Trough': [True if i in troughs else False for i in range(len(df))]
        })

        # Set the index to Date
        analysis_df.set_index('Date', inplace=True)

        # Store the DataFrame in the dictionary
        result_dict[column] = analysis_df

    return result_dict

In [None]:
# Find peaks and throughs for high interest features
peaks_throughs_dict = get_peaks_and_troughs(moderate_interest_df)
peaks_throughs_dict.keys()

In [None]:
# Initialize a dictionary to store results for each feature
peaks_dict = {}
throughs_dict = {}

# Iterate over each column in the DataFrame
for column in moderate_interest_df.columns:
    # Identify peaks
    peaks, _ = find_peaks(moderate_interest_df[column]) # Returns an array of local maxima

    # Identify troughs by inverting the data
    throughs, _ = find_peaks(-moderate_interest_df[column]) # Returns an array of local minima

    # Create a DataFrame to store the results for peaks
    moderate_interest_peaks = pd.DataFrame({
        'Date': moderate_interest_df.index,
        'Interest': moderate_interest_df[column],
        'Peak': [True if i in peaks else False for i in range(len(moderate_interest_df))]
    })

    # Create a DataFrame to store the results for troughs
    moderate_interest_throughs = pd.DataFrame({
        'Date': moderate_interest_df.index,
        'Interest': moderate_interest_df[column],
        'Through': [True if i in throughs else False for i in range(len(moderate_interest_df))]
    })

    # Store the DataFrames in the dictionaries
    peaks_dict[column] = moderate_interest_peaks
    throughs_dict[column] = moderate_interest_throughs


In [None]:
# Return all peaks with dates
peaks_only = peaks_dict['Food system health'][peaks_dict['Food system health']['Peak'] == True]
peaks_only.tail(3)

In [None]:
# Return all throughs with dates
throughs_only = throughs_dict['Food system health'][throughs_dict['Food system health']['Through'] == True]
throughs_only.tail(3)

**Exploration of Frequency of Peaks**
* Explore the frequency of peaks to identify patterns in interest.

In [None]:
# Calculate difference between consecutive peaks (add the difference to the next peak to the row of the previous)
peaks_only_copy = peaks_only.copy()
peaks_only_copy['Difference'] = peaks_only_copy['Date'].diff()
peaks_only_copy.head(3)

In [None]:
# Show the frequency of different peak-to-peak time intervals
peaks_only_copy['Difference'].value_counts()

**Observations:**
* The interest in 'Food system health' follows regular patterns, most commonly peaking every two weeks (14 days), three weeks (21 days), or monthly (28 days).
* Less frequent peaks occur every 42, 49, or 56 days, suggesting periodic but less common events or activities driving interest.
* Unique intervals of 35 days and 84 days indicate some randomness, possibly due to unexpected spikes or outlier events.
* The data shows at least one peak of interest every quarter (84 days).

**Conclusion:**

The proposed feature 'Food system health' receives interest in mostly regular patterns, with at least one peak of interest expected every quarter.
The causes for these regular patterns could be due to various factors, including events, activities, external conditions (weather, health crises, climate change), media coverage, and other potential influences.

**Exploration of Frequency of Troughs**
* Explore the frequency of troughs to identify patterns in interest.

In [None]:
# Calculate difference between consecutive throughs
throughs_only_copy = throughs_only.copy()
throughs_only_copy['Difference'] = throughs_only_copy['Date'].diff()
throughs_only_copy.head(3)

In [None]:
# Show frequency of different trough-to-trough time intervals
throughs_only_copy['Difference'].value_counts()

**Observations:**
* The time intervals between troughs in interest for 'Food system health' most commonly occur at 3 weeks (21 days), 2 weeks (14 days), and 5 weeks (35 days).
* Troughs less commonly occur every 28, 42, or 56 days.
* Some randomness is present, as indicated by unique intervals of 49 days and 70 days, occurring only once.
* The data shows that there is always at least one trough every 10 weeks (70 days), which is slightly less than the longest interval for peaks (84 days).

**Conclusion**:

The consistent intervals between troughs, primarily at 2, 3, and 5-week intervals, suggest recurring events or activities influence interest in 'Food system health'. The higher frequency of shorter peak-to-peak intervals compared to trough-to-trough intervals indicates that interest tends to rise more quickly than it declines, implying a gradual increase in overall interest over time.



**Analysis of Magnitude of Peaks and Troughs**
* Evaluate volantility of interest
* Evaluate overall trend and recent trends in interest.
* Analyze how interest changed over the last 3 years.

In [None]:
# Define a function to analyze peaks and throughs - return average yearly peaks and troughs
def analyze_peaks_and_troughs(df):
    """
    Analyzes peaks and troughs in a DataFrame.

    This function performs the following steps:
    1. Filters the DataFrame to include only rows where either 'Peak' or 'Trough' is True.
    2. Sorts the peaks and troughs by interest in descending and ascending order, respectively.
    3. Calculates the average interest for peaks and troughs per year.
    4. Calculates the dates of the highest peak and the lowest trough.
    5. Prints the results.

    Parameters:
    df (pd.DataFrame): A DataFrame with columns 'Interest', 'Peak', and 'Trough',
                       and a DateTimeIndex.

    Returns:
    tuple: Five pandas Series containing the average interest for peaks and troughs per year and
    sorted peaks and troughs by interest.
    """
    # Add a new column for the year
    df['Year'] = df.index.year

    # Filter for peaks and troughs
    peaks_throughs_only = df[(df['Peak'] == True) | (df['Trough'] == True)]

    # Sort peaks and troughs by interest
    peaks_by_interest = peaks_throughs_only[peaks_throughs_only['Peak'] == True].sort_values(by='Interest', ascending=False)
    troughs_by_interest = peaks_throughs_only[peaks_throughs_only['Trough'] == True].sort_values(by='Interest', ascending=True)

    # Calculate average interest for peaks and troughs per year
    average_peaks_per_year = peaks_throughs_only[peaks_throughs_only['Peak'] == True].groupby('Year')['Interest'].mean().round().astype(int)
    average_troughs_per_year = peaks_throughs_only[peaks_throughs_only['Trough'] == True].groupby('Year')['Interest'].mean().round().astype(int)

    # Show the date of the highest peak and the lowest trough
    highest_peak_date = peaks_by_interest.index[0].strftime("%m/%d/%Y") if not peaks_by_interest.empty else "No peaks"
    lowest_trough_date = troughs_by_interest.index[0].strftime("%m/%d/%Y") if not troughs_by_interest.empty else "No troughs"

    # Print the results
    print(f'The date with the highest interest is {highest_peak_date}')
    print("-" * 50)

    print(f'The date with the lowest interest is {lowest_trough_date}')
    print("-" * 50)

    print("Average Peaks Per Year:")
    print(average_peaks_per_year.to_string())
    print("-" * 50)

    print("Average Troughs Per Year:")
    print(average_troughs_per_year.to_string())
    print("-" * 50)
    return average_peaks_per_year, average_troughs_per_year, peaks_by_interest, troughs_by_interest

In [None]:
# Show peaks and throughs for 'Food system health'
peaks_throughs_dict['Food system health'].head(3)

**Analysis of Interest Trend Over Time**
* Compare yearly average peaks and troughts to assess fluctuatuations in interest over the last 3 years.
* Determine the general trend of interest over the years.
* Determine the stability or volantility of interest

In [None]:
# Return analysis results for 'Food system health'
average_peaks, average_troughs, peaks_by_interest, troughs_by_interest= analyze_peaks_and_troughs(peaks_throughs_dict['Food system health'])

**Observations:**
* The date with the highest peak in interest is 04/14/2024.
* The date with the lowest through in interest is 12/24/2023.
* The average through values consistently rose from 10 in 2021 to 13 in 2022 and 2023, and 16 in 2024.
* The average peak values and through values are close in value, indicating overall stable interest in 'Food system health'.

**Conclusion:**

The data reveals a steady upward trend in interest for 'Food system health' from 2021 to 2024, with both peak and trough interest levels increasing. This consistent growth, coupled with the relatively small difference between peak and trough values, suggests a sustained and stable interest in this proposed feature.

**Exploration of Yearly Peak Interests**
* Identify extreme values in interest to understand times of highest interest.

In [None]:
# Find the maximum peak in every year
peaks_by_year = peaks_only.groupby(peaks_only['Date'].dt.year)['Interest'].max()
peaks_by_year

**Observations:**
* The data indicates a consistent year-over-year increase in maximum interest, with the highest peak of interest rising each year.
* Specifically, the highest peaks were 17 in 2021, 20 in 2022, 22 in 2023, and 26 in 2024.

**Conclusion:**

The data indicates a consistent, gradual increase in maximum interest in 'Food system health' each year, with the highest peak rising from 17 in 2021 to 26 in 2024. This suggests a steady and sustained growth in overall interest in the proposed feature.

**Exploration of Lowest Yearly Interests**
* Identify extreme values in interest to understand times of lowest interest.

In [None]:
# Find the lowest trough every year
throughs_by_year = throughs_only.groupby(throughs_only['Date'].dt.year)['Interest'].min()
throughs_by_year

**Observations:**
* The yearly lowest interest values in 'Food system health"' shows fluctuations.
* While a slight increase occurred from 2021 (6) to 2022 (7), a notable decrease followed in 2023 (3).
* In 2024, the lowest interest point (10) was higher than any previous year.

**Conclusion:**

Although the lowest interest values fluctuate from year to year, the 2024 data could suggest a potential rising baseline of interest in 'Food system health'.



**Current Interest Trend Analysis**
* Determine the current direction of interest (increasing or decreasing).
* Identify the date when the current trend began.

In [None]:
# Define a function to determine direction and start date of current trend
def determine_current_trend(df):
    """
    Determines if the interest is currently increasing or decreasing based on the last extrema.

    Parameters:
    df (pd.DataFrame): A DataFrame with columns 'Interest', 'Peak', 'Trough', and 'Year',
                       and a DateTimeIndex.

    Returns:
    None
    """
    # Get the current interest value and date
    current_value = df.iloc[-1]['Interest']
    current_date = df.index[-1]

    # Get the last trough value and date
    last_trough = df[df['Trough']].iloc[-1]
    last_trough_value = last_trough['Interest']
    last_trough_date = last_trough.name

    # Determine the current trend
    if current_value > last_trough_value:
        current_trend = "increasing"
    else:
        current_trend = "decreasing"

    # Calculate the duration of the current trend
    trend_duration_weeks = round((current_date - last_trough_date).days / 7)

    # Print the results
    print(f"The last trough value was {last_trough_value} on {last_trough_date.strftime('%m/%d/%Y')}.")
    print("-" * 50)
    print(f"The current interest value is {current_value} on {current_date.strftime('%m/%d/%Y')}.")
    print("-" * 50)
    print(f"\nThe interest is currently {current_trend} and has been observed for {trend_duration_weeks} week(s).")

In [None]:
# Determine current trend for 'Food system health'
determine_current_trend(peaks_throughs_dict['Food system health'])

**Observations:**
* Interest in 'Food system health' is currently at a low level of 11 as of June 23, 2024.
* There has been a slight increase of 1 unit over the past week.
* The previous trough value was 10 on June 16, 2024.

**Conclusion:**

While the current interest level in 'Food system health' is low, the recent one-week increase could indicate the beginning of an upward trend. However, this data is insufficient to definitively confirm a sustained increase.


<a name='trends-high-features'></a>
#### Trend Analysis for Features with High Interest
- Assess whether interest is increasing, decreasing, or stable over the past 3 years.
- Identify the magnitude of peaks and troughs in interest levels.
- Determine whether interest is currently rising or falling

**Summary Statistics**
* Evaluate summary statistics to get an overview of the data's characteristics.

In [None]:
# Get a dataframe for all features with 'High' interest
high_interest_df = get_relevant_keywords_df(features_by_interest, features_interest_over_time, 'High')
high_interest_df.head(3)

In [None]:
# Show summary stats for high interest features
summary_stats_high_interest = high_interest_df.describe().round().astype(int)
summary_stats_high_interest

**Observations:**

- **'Educational Value':**
  - Interest tends to cluster around a high value of 56.
  - Interest levels typically deviate by the mean by 16 units, indicating notable variability in interest.
  - The lowest weekly interest over the last three years is 22, which is significantly below the mean but still indicates moderate interest.
 - The quartile values indicate moderate variability, with interest generally tending to be on the higher side.
  - The feature received interest of 100, the highest possible value, at least once over the last 3 years.
- **'Remote access and control':**
  - Interest tends to cluster around a high value of 48.
  - Interest levels typically deviate from the mean by 24 units, indicating high variability in interest.
  - The lowest weekly interest over the last three years is 0, indicating that this feature received no interest at least once over the period.
  - The quartile values show moderate variability. The very low values (minimum of 0) pull the mean down, making it smaller than the median.
  - The feature received interest of 100, the highest possible value, at least once over the last 3 years.
- **'Plant identification':**
  - Interest tends to cluster around a high value of 54.
  - Interest levels typically deviate from the mean by 18 units, indicating significant variability in interest.
  -  The lowest weekly interest over the last three years is 26, which is the highest minimum value among the high-interest features, indicating a relatively high continuous interest.
  - The quartile values show moderate variability, with interest values generally being higher than the mean.
  - The feature received interest of 100, the highest possible value, at least once over the last 3 years.
  
  **Conclusion:**

  While all three features show peaks at 100, 'Remote Access and Control' has the highest variability and periods of no interest, whereas 'Plant Identification' has the highest minimum interest, demonstrating a consistent baseline of interest. 'Educational Value' remains the most stable with the lowest variability in interest and consistently high interest.








**Linear Regression Trend Analysis**
* Assess change in interest over time using a statistical model.
* Evaluate suitability of keywords for further analysis and prototype development based on general trend over the last 3 years.

In [None]:
high_interest_df.head(3)

In [None]:
# Get the interest trends for all high interest keywords
analyze_interest_trends(high_interest_df)

**Observations:**
* **'Educational value:**
  * Interest in the proposed feature 'Educational value' is increasing at a yearly rate of ~3.83 units.
* **'Remote access and control':**
  * Interest in the proposed feature 'Remote access and control' is significantly increasing at a yearly rate of ~19.36 units.
* **'Plant identification':**
  * Interest in the proposed feature 'Plant identification' is  decreasing at a yearly rate of ~4.6 units.
* **Statistical Significance of Results:**
  * The p-values of all features indicate that the trends shown are statistically significant, with all of them having a value < 0.05.

**Conclusion:**
* The substantial increase in interest in 'Remote access and control' highlights this feature as especially valuable and relevant for further analysis and inclusion in the prototype
* Although the increase in interest is at a lower rate compared to 'Remote Access and Control', 'Educational value' remains a relevant feature for further analysis and consideration in the prototype.
* Despite the decrease in interest, the rate of decline in interest for 'Plant identification' is low, suggesting that it remains a candidate for further analysis.

**Trend Analysis Using Peak Detection**
* Determine peaks and throughs in interest.
* Determine the average yearly magnitude of peaks and throughs and interpret results.

**Peak Detection Analysis for 'Educational value'**

In [None]:
# Find peaks and throughs for high interest features
peaks_throughs_dict = get_peaks_and_troughs(high_interest_df)
peaks_throughs_dict.keys()

In [None]:
# Analyze peaks and throughs for 'Educational value'
peaks_throughs_dict['Educational value'].head(3)

In [None]:
# Return analysis results
average_peaks, average_troughs, peaks_by_interest, troughs_by_interest= analyze_peaks_and_troughs(peaks_throughs_dict['Educational value'])

In [None]:
# Calculate variation in peaks and throughs
average_peaks.max() - average_peaks.min()
average_troughs.max() - average_troughs.min()
print(f'Variation in peaks: {average_peaks.max() - average_peaks.min()}')
print(f'Variation in troughs: {average_troughs.max() - average_troughs.min()}')

In [None]:
# Return the value of the highest interest and the lowest interest
print(f'Highest interest: {peaks_by_interest.iloc[0]["Interest"]}')
print(f'Lowest interest: {troughs_by_interest.iloc[0]["Interest"]}')

**Observations:**
* The proximity of the highest and lowest interest dates indicates significant short-term fluctuations in interest levels.
* High variation (25 and 19 units) in peak and trough interest demonstrates substantial volatility.
* Both average peak and trough interest levels show a general upward trend over the years, despite a dip in 2023.
* Peak interest reached the maximum possible level (100), indicating high potential value.
* The lowest interest level (22) represents a baseline of moderate interest.

**Conclusion:**

The feature 'Educational value' shows an overall positive trend, while significant short-term fluctuations and significant volatility suggest an unstable interest level influenced by external factors. The dip in 2023 requires further investigation to understand potential negative influences. Despite the fluctuations, the feature consistently maintains a moderate baseline of interest and has reached peak interest, indicating its potential high value. It will be considered for further analysis.

**Peak Detection Analysis for 'Remote access and control'**

In [None]:
# Analyze peaks and throughs for 'Remote access and control'
peaks_throughs_dict['Remote access and control'].head(3)

In [None]:
# Return analysis results for 'Remote access and control'
average_peaks, average_troughs, peaks_by_interest, troughs_by_interest = (
    analyze_peaks_and_troughs(peaks_throughs_dict['Remote access and control'])
)

In [None]:
# Return the value of the highest interest and the lowest interest
print(f'Highest interest: {peaks_by_interest.iloc[0]["Interest"]}')
print(f'Lowest interest: {troughs_by_interest.iloc[0]["Interest"]}')

**Observations:**
* The date of the lowest interest (0), 08/01/2021 and the date of the largest interest (100) are far apart.
* The average peak values show a steep upward trend over the last 3 years, with a highest peak of 43 in 2021 and a highest peak of 85 in 2024, and a lowest trough of 4 in 2021 and 54 in 2021.
* Average peak interest has more than doubled from 2021 to 2024, indicating rapid growth.
* The lowest interest values captured since 2022 demonstrate a baseline of moderate to high interest.

**Conclusion:**
* The peak interest reaching the maximum possible level (100) recently and the fast increase in interest over the last 3 years makes, along with stable moderate baseline interest makes 'Remote access and control' a high-value feature for the prototype and will be considered for further analysis.


**Peak Detection Analysis for 'Plant identification'**

In [None]:
# Analyze peaks and troughs for 'Plant identification'
peaks_throughs_dict['Plant identification'].head(3)

In [None]:
average_peaks, average_troughs, peaks_by_interest, troughs_by_interest = (
    analyze_peaks_and_troughs(peaks_throughs_dict['Plant identification'])
)

In [None]:
# Return the value of the highest interest and the lowest interest
print(f'Highest interest: {peaks_by_interest.iloc[0]["Interest"]}')
print(f'Lowest interest: {troughs_by_interest.iloc[0]["Interest"]}')

In [None]:
# Calculate variation in peaks and throughs
average_peaks.max() - average_peaks.min()
average_troughs.max() - average_troughs.min()
print(f'Variation in peaks: {average_peaks.max() - average_peaks.min()}')
print(f'Variation in troughs: {average_troughs.max() - average_troughs.min()}')

**Observations:**

* The dates of the lowest interest (44 on 12/17/2023) and the highest interest (100) are relatively close, suggesting some variation in data.
* The overall variation between peak and trough values is 12, relatively low compared to other keywords.
* Average peak values show a slight downward trend over the last 3 years, with peaks ranging from 51 in 2024 to 63 in 2022.
* Trough values indicate consistent interest, with the highest trough also in 2022.
* This feature has demonstrated consistently high interest over the last 3 years, with a significantly higher value for the lowest interest than other keywords.

**Conclusion:**

The relatively low variation between peak and trough interest, along with the consistently high interest over the last three years, suggest that this feature is a strong candidate for further analysis. The observed downward trend in average peak values should be further investingated to understand potential factors influencing this decline.


**Current Interest Trend Analysis**
* Determine the current direction of interest (increasing or decreasing).
* Identify the date when the current trend began.

**Current Interest Trend for 'Educational value'**

In [None]:
# Determine the current trend for 'Educational value'
determine_current_trend(peaks_throughs_dict['Educational value'])

**Observations:**
* Interest in 'Educational value' is currently at a moderate level of 48 as of June 23, 2024.
* There has been an increase of 11 units over the past week.

**Conclusion:**

The recent increase in interest requires further monitoring due to the historically fluctuating nature of interest in this keyword. Given the one-week timeframe of the current upward trend, it's too early to determine if this signals a significant increase in interest or a short-term fluctuation.

**Current Interest Trend for 'Remote access and control'**

In [None]:
# Determine the current trend for 'Remote access and control'
determine_current_trend(peaks_throughs_dict['Remote access and control'])

**Observations:**
* Interest in this feature has been increasing over the last week by 11 units and is currently at a high value of 58 as of 06/23/2024.
* Interest in this feature is currently slighly higher than in 'Educational value'

**Conslusion:**

* The interest in 'Remote access and control' is currently high and on an upward trend, indicating potential for further growth.
* Although the interest has been increasing for one week, further analysis is required to understand the long-term significance and sustainability of this trend.

**Current Interest Trend for 'Plant identirication'**

In [None]:
# Determine the current trend for 'Plant identification'
determine_current_trend(peaks_throughs_dict['Plant identification'])

**Observations:**
* Interest in this feature has been decreasing for the past four weeks, reaching a current value of 53.
* Overall interest remains similar to other 'high' relevance features. It currently exceeds interest in 'Educational value' but falls below that of 'Remote access and control'.
* This recent decline aligns with a broader downward trend observed over a longer period.

**Conclusion:**

The current data supports the previously identified downward trend in interest for this feature. However, the overall interest level remains comparable to other high-relevance features, suggesting continued potential value. Further investigation is needed to determine the cause of the decline and assess whether this trend is likely to continue.


<a name='trends-critical-features'></a>
#### Trend Analysis for Features with Critical Interest
* Evaluate key metrics of interest over time.
* Assess whether interest is increasing, decreasing, or stable over the past 3 years.
* Assess peaks and troughs in interest levels.
* Determine current trend in interest.

**Summary Statistics**
* Evaluate summary statistics to get an overview of the data's characteristics.

In [None]:
# Get a dataframe for all features with 'Critical' interest
critical_interest_df = get_relevant_keywords_df(features_by_interest, features_interest_over_time, 'Critical')
critical_interest_df.head(3)

In [None]:
# Show summary stats for critical features
summary_stats_critical_interest = critical_interest_df.describe().round().astype(int)
summary_stats_critical_interest

**Observations:**

- **'Plant detection':**
  - Interest tends to cluster around a high value of 64.
  - Interest levels typically deviate by the mean by 17 units, indicating some variability in interest.
  - The lowest weekly interest over the last three years is 23, which is significantly below the mean but still indicates moderate interest.
 - The quartile values indicate moderate variability, with interest generally tending to be on the higher side.
  - The feature received interest of 100, the highest possible value, at least once over the last 3 years.
- **'Solar powered':**
  - Interest tends to cluster around a critical value of 75.
  - Interest levels typically deviate from the mean by 12 units, indicating low variability in interest.
  - The lowest weekly interest over the last three years is 51, indicating that this feature consistently receives high interest.
  - The quartile values show low variability. Indicating continuously high interest in this feature.
  - The feature received interest of 100, the highest possible value, at least once over the last 3 years.
  
**Conclusion:**

Both features show continuously high interest. 'Solar powered' never received an interest value less than 51, showing consistently high interest with very low variability. This indicates that 'Solar powered' is a highly stable and critical feature. 'Plant detection' also receives continuously high interest, but with more variation compared to 'Solar Powered'.
The interest levels vary more, but the feature remains highly relevant overall.

**Linear Regression Trend Analysis**
* Assess change in interest over time using a statistical model.
* Evaluate suitability of keywords for further analysis and prototype development based on general trend over the last 3 years.

In [None]:
critical_interest_df.head(3)

In [None]:
# Get interest trends for critical interst features
analyze_interest_trends(critical_interest_df)

**Observations:**
* Interest in the proposed feature 'Plant detection' is increasing significantly at a yearly rate of ~12.38 units.
* Interest in the proposed feature 'Solar powered' is decreasing at a yearly rate of ~2.54 units.
* The trends for all features are statistically significant (p-values < 0.05).

**Conclusion:**

* The substantial increase in interest in 'Plant detection' highlights this feature as especially valuable and relevant for further analysis and inclusion in the prototype.
* Despite a slight decline, 'Solar powered' remains valuable due to its relatively high overall interest and moderate rate of decline.

**Trend Analysis Using Peak Detection**
* Determine peaks and throughs in interest.
* Determine the average yearly magnitude of peaks and throughs and interpret results.
* Evaluate the yearly variability in interest.

In [None]:
# Get the peaks and troughs for all critical interest features
peaks_throughs_dict = get_peaks_and_troughs(critical_interest_df)
peaks_throughs_dict.keys()

**Peak Detection Analysis for 'Plant detection'**

In [None]:
# Show the dataframe with calculated peaks and troughs
peaks_throughs_dict['Plant detection'].head(3)

In [None]:
# Calculate average peaks and troughs and show dates with highest and lowest interest
average_peaks, average_troughs, peaks_by_interest, troughs_by_interest = (
    analyze_peaks_and_troughs(peaks_throughs_dict['Plant detection'])
)

In [None]:
# Calculate the highest and lowest interest value
print(f'Highest interest: {peaks_by_interest.iloc[0]["Interest"]}')
print(f'Lowest interest: {troughs_by_interest.iloc[0]["Interest"]}')

In [None]:
# Calculate growth rate between peaks from 2021 to 2024
growth_rate = (average_peaks.iloc[-1] - average_peaks.iloc[0]) / average_peaks.iloc[0] * 100
print(f'Percentage change in average peak interest: {round(growth_rate, 2)}%')

**Observations:**
* Interest in 'Plant Detection' has shown a clear and consistent upward trend over the past three years.
* The highest peak interest of 100 was reached recently on March 17, 2024.
* Average peak interest has increased by approximately 85% from 2021 to 2024.
* Trough values have also increased substantially, indicating sustained high interest.

**Conclusion:**

The consistent and fast growth in both peak and trough interest levels strongly suggests that 'Plant Detection' is a high-value feature. The recent peak in interest further reinforces the feature's potential both inclusion in the prototype as well as further analysis.




**Peak Detection Analysis for 'Solar powered'**

In [None]:
# Show the dataframe with peaks and troughs
peaks_throughs_dict['Solar powered'].head(3)

In [None]:
# Calculate average peaks and troughs and show dates with highest and lowest interest
average_peaks, average_troughs, peaks_by_interest, troughs_by_interest = (
    analyze_peaks_and_troughs(peaks_throughs_dict['Solar powered'])
)

In [None]:
# Calculate the highest and lowest interest value
print(f'Highest interest: {peaks_by_interest.iloc[0]["Interest"]}')
print(f'Lowest interest: {troughs_by_interest.iloc[0]["Interest"]}')

**Observations:**
* Interest in 'Solar powered' has remained relatively stable at a critically high level over the past three years, with a slight downward trend.
* The lowest average trough value observed is 64, still indicating strong interest.
* Both the highest peak (100) and lowest trough (51) occurred in 2022, suggesting fluctuations in interest during that year.
O* verall peak values remained consistently high, ranging from 72 to 83.

**Conclusion:**

The consistently high interest levels, even with slight fluctuations and a minor downward trend, strongly suggest that 'Solar powered' is a valuable feature both for inclusion in the MVP as well as for further analysis. The downward trend in interest should be further investigated.



**Current Interest Trend Analysis**
* Determine the current direction of interest (increasing or decreasing).
* Identify the date when the current trend began.

**Current Interest Trend for 'Plant detection'**

In [None]:
determine_current_trend(peaks_throughs_dict['Plant detection'])

**Observations:**
* Interest in this feature has increased by 6 units over the last week, reaching a current high value of 63.

**Conclusion:**
* This recent upward trend indicates a potential for further growth in interest.
* Continued monitoring and analysis are required to fully understand the nature and sustainability of this trend.

**Current Interest Trend for 'Solar powered'**

In [None]:
determine_current_trend(peaks_throughs_dict['Solar powered'])

**Observations:**
* Interest in 'Solar powered' has seen a modest increase of 2 units over the past two weeks, reaching a very high value of 89.

**Conclusion:**
* This recent uptick, while small, is significant given the previous slight decline observed over the past three years. The current high interest level suggests a potential for continued growth and a possible reversal of the downward trend.
* Further monitoring is crucial to confirm this potential shift in the interest trajectory.

<a name='features-visualizations'></a>
### Visualizations: Interest Over Time For Key Features
* Visualize weekly interest over 3 years per interest category.

<a name='features-moderate-visualizations'></a>
#### Weekly Interest for Moderate Interest Features
* Aggregate the data over monthly intervals.
* Use a heatmap to show patterns in interest over time.

In [None]:
# Show the dataframe for all features with 'Moderate' interest
moderate_interest_df.head(3)

In [None]:
# Show structure of the dataframe
moderate_interest_df.info()

In [None]:
# Import necessary libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Define a function to costruct a heatmap showing monthly interest over 3 years
def create_heatmap(df, interest_level, category='Features'):
    """
    This function takes a DataFrame with a date index and multiple keyword columns,
    resamples the data to a monthly frequency, transposes it, and creates a heatmap.

    Parameters:
    df (pd.DataFrame): The input DataFrame with a date index and keyword columns.
    interest_level (str): The interest level description to be used in the heatmap title.
    category (str): The category of the interest level ('Features'/'Problems'/'Needs').

    Returns:
    None
    """
    # Ensure the date column is set as index
    if not isinstance(df.index, pd.DatetimeIndex):
        raise ValueError("The DataFrame index must be a DatetimeIndex.")

    # Resample the data to a monthly frequency and take the mean
    df_monthly = df.resample('M').mean()

    # Transpose the DataFrame for heatmap
    df_transposed = df_monthly.T

    # Change the column labels to display month and year
    df_transposed.columns = df_transposed.columns.strftime('%b %Y')

    # Create the heatmap
    plt.figure(figsize=(12, 4), dpi=300)
    sns.heatmap(df_transposed, cmap='viridis', cbar=True, vmin=1, vmax=100)
    plt.title(f'Interest in "{interest_level}" Interest {category} Over Time')
    plt.xlabel('Date')
    plt.ylabel('Keyword')
    plt.xticks(rotation=90, ha='right', fontsize=10)
    plt.show()

In [None]:
# Return heatmap for moderate interest features
create_heatmap(moderate_interest_df, 'Moderate')

**Observations:**
* Interest in 'Food system health' (the only 'moderate' interst feature) has remained consistently low over the three years, primarily ranging between 0 and 40.
* A noticeable increase in interest has occurred since late 2023, with sustained levels above 20 units and two distinct peaks in 2024, which can be observed by less darker colors in this interval and two bars in a lighter shade of blue, indicating values between 40 and 50.
* The diagram shows that interest is historically low in the summer months from June until August, therefore a decline can be expected also in 2024. This is validated by the bar for June 2024 being substantially darker than the previous bars.

**Conclusion:**

The feature 'Food system health' has consistently low interest, but the data suggests that an increase in interest is possible, while a temporary decrease in the coming summer months is likely.

<a name='features-high-visualizations'></a>
#### Weekly Interest for High Interest Features
* Aggregate the data over monthly intervals.
* Use a heatmap to show patterns in interest over time.

In [None]:
# Show high interst features
high_interest_df.head(3)

In [None]:
# Return heatmap for high interest features
create_heatmap(high_interest_df, 'High')

**Observations:**

* **'Educational Value'**:  
  * Interest has consistently remained approximately above 30 throughout the three-year period, highlighting a sustained baseline of interest.
  * A notable increase occured in spring 2022, with interest from 60 to 90 over 5 months.
  * The interest has continued to rise since then, reaching another peak in spring 2024 and remaining high (between 60 and 80) in the current period.
* **'Remote access and control':**
  * This keyword demonstrates a notable growth trajectory over the last 3 years.
  * Starting from very low interest in 2021, it experienced a gradual and consistent increase, culminating in being the keyword with the highest interest in the current period, ranging between 50 and 90 since September 2022.

* **'Plant identification':**
  * A clear seasonal pattern is evident, with peak interest consistently occurring during the summer months of each year.
  * While the overall interest in 2024 appears slightly lower than in previous years, there's a recent increase in interest starting from April 2024, suggesting the possibility of another summer peak.

**Conclusion:**
 * 'Educational value' maintains consistently high interest, while 'Remote access and control' shows a notable upward trend, currently holding the highest interest among the analyzed keywords. 'Plant identification' demonstrates a strong seasonal pattern, possibly linked to summer in the northern hemisphere. These findings suggest prioritizing 'Remote access and control' for further analysis and inclusion in the prototype, while further investigating the potential of 'Educational value' and the seasonal nature of 'Plant identification'.






<a name='features-critical-visualizations'></a>
#### Visualize Interest for Critical Interest Features Over Time
* Aggregate the data over monthly intervals.
* Use a heatmap to show patterns in interest over time.

In [None]:
# Show critical interest features
critical_interest_df.head(3)

In [None]:
# Define a function to costruct a heatmap showing monthly interest over 3 years
def create_heatmap(df, interest_level, category='Features'):
    """
    This function takes a DataFrame with a date index and multiple keyword columns,
    resamples the data to a monthly frequency, transposes it, and creates a heatmap.

    Parameters:
    df (pd.DataFrame): The input DataFrame with a date index and keyword columns.
    interest_level (str): The interest level description to be used in the heatmap title.
    category (str): The category of the interest level ('Features'/'Problems'/'Needs').

    Returns:
    None
    """
    # Ensure the date column is set as index
    if not isinstance(df.index, pd.DatetimeIndex):
        raise ValueError("The DataFrame index must be a DatetimeIndex.")

    # Resample the data to a monthly frequency and take the mean
    df_monthly = df.resample('M').mean()

    # Transpose the DataFrame for heatmap
    df_transposed = df_monthly.T

    # Change the column labels to display month and year
    df_transposed.columns = df_transposed.columns.strftime('%b %Y')

    # Create the heatmap
    plt.figure(figsize=(12, 8))
    sns.heatmap(df_transposed, cmap='viridis', cbar=True, vmin=1, vmax=100)
    plt.title(f'Interest in "{interest_level}" Interest {category} Over Time')
    plt.xlabel('Date')
    plt.ylabel('Keyword')
    plt.xticks(rotation=90, ha='right', fontsize=10)
    plt.show()

In [None]:
# Return heatmap for critical interest features
create_heatmap(critical_interest_df, 'Critical')

**Interpretation of Results:**

* **'Plant detection'**:  
  * Interest has consistently increased over the last three years, transitioning from moderate levels (40-60) in the first half of 2021 to high levels (60-100) since January 2022, continuosly rising since then
  *  There have been two notable decreases: one in the summer of 2022 (50) and another recent one in June 2024 (70).
  * Overall, the trend is upward, with no strong evidence of seasonality.
* **'Solar powered':**
  * Interest has remained consistently high throughout the three years, rarely dipping below 50.
  * The data suggests seasonality, with higher interest during summer months.
  * While a slight decline was observed in the summer of 2023, interest has rebounded and increased since February 2024 to a current value of 90.

* **Conclusion:**
 * The steady increase in interest for 'Plant detection' over the past three years, despite minor fluctuations, highlights its potential as a valuable feature to consider for the MVP. The consistently high interest in 'Solar powered,' with a clear seasonal peaks in summer, suggests long term interest in this feature and makes it a strong candidate for the prototype as well.Both features are valuable for further analysis, this analysis should include an investigation of the seasonality in interest for 'Solar powered'.





<a name='key-problems-analysis'></a>
## Key Problems Analysis
- Identify and evaluate the proposed problems most relevant to potential customers.
- Analyze trends in interest for most relevant problems.
- Visualize interest over time for the key problems.

<a name='most-relevant-problems'></a>
### Determine Problems with Highest Relevance
- Present problems sorted by their relevance.
- Determine problems with significant relevance.
- Establish thresholds to classify problems into the following categories:
  - Critical: Essential and highly relevant problems.
    - Average weekly interest 75% - 100%
  - High: Very important and significantly relevant problems.
    - Average weekly interest: 50% - 75%
  - Moderate: Important but not essential problems.
    - Average weekly interest: 10% - 50%
  - Low: Problems of little relevance - will be excluded from further analysis.
    - Average weekly interest: 0% - 10%
- The percentages used to define the interest categories are relative to the highest observed average weekly interest value in the dataset.



In [None]:
# Show problems sorted by relevance dataframe
problems_by_relevance.head(3)

In [None]:
# Classify feature keyword into interest categories
problems_by_relevance = get_relative_interest(problems_by_relevance)
problems_by_relevance.head(3)

In [None]:
# Calculate percentage of low interest problems
low_relevance_percentage = round((problems_by_relevance['relevance'] == 'Low').mean() * 100)
print(f'Percentage of low interest problems: {low_relevance_percentage}%')

In [None]:
# Compare the higest relevance problem with the highest relevance feature
highest_relevance_problem = problems_by_relevance.loc[problems_by_relevance['relevance'] == 'Critical'].head(1)
highest_relevance_feature = features_by_interest.loc[features_by_interest['relevance'] == 'Critical'].head(1)

# Get the highest interest values
problem_interest = highest_relevance_problem['weekly_average_interest'].values[0]
feature_interest = highest_relevance_feature['weekly_average_interest'].values[0]

# Calculate difference in interest between the highest relevance feature and problem
difference = feature_interest - problem_interest

# Convert difference to a percentage of the problem's interest
percentage_difference = (difference / problem_interest) * 100

print(f'The most relevant problem is {round(percentage_difference)}% less relevant than the most relevant feature')


In [None]:
# Show highest relevance problem
highest_relevance_problem

In [None]:
# Show highest relevance feature
highest_relevance_feature

In [None]:
# Show moderate interest features
features_by_interest[features_by_interest['relevance']=='Moderate']

#### Results
* 91% of Proposed Problems Have Low Relevance:
  * The data reveals that only one proposed problem has an average weekly interest exceeding 10%. The remaining 91% of problems demonstrate low weekly interest, averaging below 10%. This indicates low demand for a solution for these problems. The prototype should not prioritize addressing these problems, and they will be excluded from further analysis.
* Critical Relevance:
  * The only relevant problem is 'Monoculture in agriculture', which is the only problem with weekly interest exceeding 10%. This problem should be the primary focus of the prototype development.
  * While 'Monoculture in agriculture' is the most relevant problem, it's important to note its interest level is 390% lower than the top-ranked feature, 'Solar powered'. However, it still shows moderate interest, comparable to the feature 'Food system health', which received a similar weekly interest score of 14.73.
  * Due to its critical relevance to the problem domain and its moderate level of interest, the trend in interest for 'Monoculture in agriculture' will be further analyzed to determine if interest in this topic is growing, stable, or declining.

<a name='key-problems-trend-analysis'></a>
### Detailed Trend Analysis of Key Problem(s)
- Analyze trends in interest for problems with moderate to critical relevance
  - Calculate and interpret summary statistics
    - Count
    - Mean
    - Standard Deviations
    - Minimum Interest
    - Maximum Interest
    - Quartiles
    - Variance (inferred)
  - Identify Trends
    - Determine if interest over the last 3 years is increasing, decreasing or stable.
    - Determine peaks and troughts in interest.
    - Determine if interest is currently rising or falling.

<a name='trends-critical-problems'></a>
#### Trend Analysis for Problems with Critical Interest
- Assess whether interest has been increasing, decreasing, or stable over the past 3 years.
- Identify the peaks and troughs in interest levels.
- Determine the current trend in interest, whether it is rising or falling.

**Summary Statistics**
* Evaluate summary statistics to get an overview of the data's characteristics.

In [None]:
# Show all problems with critical relevance
critical_relevance_df = get_relevant_keywords_df(problems_by_relevance, problems_relevance_over_time, 'Critical')
critical_relevance_df.head(3)

In [None]:
# Show summary stats for critical relevance problems
summary_stats_critical_relevance = critical_relevance_df.describe().round().astype(int)
summary_stats_critical_relevance

**Observations:**
* Interest in **'Monoculture in Agriculture'** is highly variable, with a mean value of 15 and a standard deviation of 28.
* The lowest recorded interest is 0, with all quartiles also at 0. This indicates that interest is absent for at least 75% of the observed weeks.
* Despite the typically low interest, the feature has reached a maximum interest level of 100, suggesting occasional spikes of intense interest.

**Conclusion:**

The data reveals a pattern of near-zero interest in 'Monoculture in Agriculture', interrupted by infrequent but intense spikes in interest. This high variability, with a majority of weeks showing no interest, suggests that the mean value is skewed upwards by these short-lived peaks. Further analysis is crucial to understand the timing and potential causes of these spikes, as well as the overall trend in interest over time.

**Linear Regression Trend Analysis**
* Assess change in interest over time using a statistical model.
* Evaluate suitability of keywords for further analysis and prototype development based on general trend over the last 3 years.

In [None]:
# Get interest trends for critical interst problems
analyze_interest_trends(critical_relevance_df, slope_decimals=4)


**Observations:**
* The linear regression model indicates a slight positive slope of 0.0047 for the keyword **'Monoculture in agriculture'**.
* The p-value of 0.5062 suggests that this slope is not statistically significant, meaning there's not enough evidence to confidently say there's a true upward trend over time.

**Conclusion:**

The trend analysis for 'Monoculture in agriculture' does not reveal a statistically significant change in interest over the past three years. Due to the high variability in the data, further investigation using alternative methods or a larger dataset is needed to definitively determine if a true trend exists.

**Trend Analysis Using Peak Detection**
* Determine peaks and throughs in interest.
* Determine the average yearly magnitude of peaks and throughs and interpret results.
* Evaluate the yearly variability in interest.

In [None]:
# Calculate peaks and troughs in critical intersest problems
peaks_throughs_dict = get_peaks_and_troughs(critical_relevance_df)
peaks_throughs_dict.keys()

**Peak Detection Analysis for 'Monoculture in agriculture'**

In [None]:
# Show the dataframe
peaks_throughs_dict['Monoculture in agriculture'].head(3)

In [None]:
# Calculate average peaks and troughs and show dates with highest and lowest interest
average_peaks, average_troughs, peaks_by_interest, troughs_by_interest = (
    analyze_peaks_and_troughs(peaks_throughs_dict['Monoculture in agriculture'])
)

In [None]:
# Calculate growth rate between peaks from 2021 to 2024
growth_rate = (average_peaks.iloc[-1] - average_peaks.iloc[0]) / average_peaks.iloc[0] * 100
print(f'Percentage change in average peak interest: {round(growth_rate, 2)}%')

**Observations:**
* Interest in 'Monoculture in Agriculture' demonstrates high variability, fluctuating between 0 and 100 over the past three years.
* Average peak interest has decreased slightly by 7.9% from 2021 to 2024, which suggests that there is likely no statistically relevant upward trend in intrerest.
* The highest and lowest interest dates in 2021 are close together, suggesting that peaks in interest are sporadic rather than consistent.
* Average trough interest remains consistently low, often reaching zero, indicating periods of minimal interest.

**Conclusion:**

Interest in 'Monoculture in Agriculture' is highly variable, with sporadic peaks, likely influenced by external factors. While a slight decline in average peak interest has been observed, it is not statistically significant and may be due to random fluctuations. The relevance of this problem for prototype design requires further analysis, like and analysis of seasonal patterns or an analysis using a bigger dataset.

**Current Interest Trend Analysis**
* Determine the current direction of interest (increasing or decreasing).
* Identify the date when the current trend began.

**Current Interest Trend for 'Monoculture in agriculture'**

In [None]:
# Determine if interest is currently increasing or decreasing
determine_current_trend(peaks_throughs_dict['Monoculture in agriculture'])

**Observations:**
* Interest in 'Monoculture in agriculture' has increased from 0 to 17 units over the past two weeks.

**Conclusion:**

This recent spike in interest aligns with the historically volatile nature of this topic. However, it could also signal a potential shift towards more sustained interest in 'Monoculture in Agriculture'. Further monitoring and analysis are crucial to determine whether this increase is a temporary surge or the beginning of a longer-term trend.

<a name='problems-visualizations'></a>
### Visualizations: Interest Over Time For Key Problems
* Aggregate the data over monthly intervals.
* Use a heatmap to show patterns in interest over time.

<a name='problems-critical-visualizations'></a>
#### Visualization for Critical Interest Problems
* Visualize weekly interest over 3 years

In [None]:
# Show data to visualize
critical_relevance_df.head(3)

In [None]:
# Create a heatmap to show interest over time
create_heatmap(critical_relevance_df, 'Critical', category='Problems')

**Observations:**

* Interest in 'Monoculture in agriculture' is highly variable, with most months showing little to no interest.
* Distinct peaks of high interest typically occur once a year in the spring months, particularly February.

**Conclusion:**

The timing of peak interest in 'Monoculture in agriculture' may be related to the planting season in the northern hemisphere, although further analysis is needed to confirm this hypothesis. Given that 'Monoculture in agriculture' received the highest
interest among the suggested problems, it will be considered for further analysis despite its inconsistent and absolutely low interest. However, to ensure the more relevant problems are addressed with the prototype, additional data should be gathered on alternative keywords.

<a name='key-needs-analysis'></a>
## Key Needs Analysis
- Identify and evaluate high priority needs of potential customers.
- Analyze trends in interest for most relevant needs.
- Visualize interest over time for the key needs.

<a name='most-relevant-needs'></a>
### Determine Needs with Highest Relevance
- Present needs sorted by their relevance.
- Determine needs with significant relevance.
- Establish thresholds to classify needs into the following categories:
  - Critical: Essential and highly relevant needs.
    - Average weekly interest 75% - 100%
  - High: Very important and significantly relevant needs.
    - Average weekly interest: 50% - 75%
  - Moderate: Important but not essential needs.
    - Average weekly interest: 10% - 50%
  - Low: Needs of little relevance - will be excluded from further analysis.
    - Average weekly interest: 0% - 10%
- The percentages used to define the interest categories are relative to the highest observed average weekly interest value in the dataset.



In [None]:
# Show needs sorted by relevance dataframe
needs_by_relevance.head(3)

In [None]:
# Add a categorical column splitting keywords into categories
needs_by_relevance = get_relative_interest(needs_by_relevance)
needs_by_relevance.head(4)

In [None]:
# Calculate percentage of low interest needs
low_relevance_percentage = round((needs_by_relevance['relevance'] == 'Low').mean() * 100)
print(f'Percentage of low relevance needs: {low_relevance_percentage}%')

In [None]:
# Show moderate interst features
features_by_interest[features_by_interest['relevance']=='Moderate']

In [None]:
# Show high interst features
features_by_interest[features_by_interest['relevance']=='High']

In [None]:
# Show critical interest problems
problems_by_relevance[problems_by_relevance['relevance']=='Critical']

#### Results
* 86% of Proposed Needs Have Low Relevance:
  * The majority of proposed needs (86%) show low relevance, with weekly interest levels falling below 10%.
  * These low relevance needs will not be considered for further analysis.
* Moderate Relevance:
  * 'Sustainable food production' is the only need identified as moderately relevant, with an average weekly interest of 25 units.
  * Further trend analysis is necessary to determine if this need justifies inclusion in the prototype development.
* High Relevance:
  * Despite its high relevance classification, the interest level of 29 units in 'Educational AI' is only slightly higher than the moderately relevant need 'Sustainable food production'.
  * Interst is substantially lower than the lowest high insterest value feature 'Remote access and control' (48).
  *  Further trend analysis will determine this keyword's relevance for further analysis and prototype development.
* Critical Relevance:
  * 'Climate resilience' stands out with critical average montly relevance of 58 units, indicating strong interest. While its interest level would be considered 'high' relative to feature keywords, it remains a top priority for further analysis and potential inclusion in the prototype.

<a name='key-needs-trend-analysis'></a>
### Detailed Trend Analysis of Key Needs
- Analyze trends in interest for needs with moderate to critical relevance
  - Calculate and interpret summary statistics
    - Count
    - Mean
    - Standard Deviations
    - Minimum Interest
    - Maximum Interest
    - Quartiles
    - Variance (inferred)
  - Identify Trends
    - Determine if interest over the last 3 years is increasing, decreasing or stable.
    - Determine peaks and troughts in interest.
    - Determine if interest is currently rising or falling.

<a name='trends-moderate-needs'></a>
#### Trend Analysis for Needs with Moderate Interest
- Assess whether interest has been increasing, decreasing, or stable over the past 3 years.
- Identify the peaks and troughs in interest levels.
- Determine the current trend in interest, whether it is rising or falling.

**Summary Statistics**
* Evaluate summary statistics to get an overview of the data's characteristics.


In [None]:
# Show all needs with moderate relevance
moderate_relevance_df = get_relevant_keywords_df(needs_by_relevance, needs_relevance_over_time, 'Moderate')
moderate_relevance_df.head(3)

In [None]:
# Get summary stats
summary_stats_moderate_relevance = moderate_relevance_df.describe().round().astype(int)
summary_stats_moderate_relevance

**Observations:**
- The data for **'Sustainable food production'** reveals a relatively stable interest level over the last three years, with an average weekly interest of 25 units.
- The standard deviation of 10 indicates moderate variability compared to other analyzed keywords, indicating stable interest.
- The lowest recorded interest is 0, meaning there was at least one week with no interest in the topic.
- A maximum interest level of 48 suggests consistently moderate interest in this feature.
- Interest ranges from 0 to 48, suggesting fluctuations over time, but the majority of data points cluster between 18 and 33, indicating consistent moderate interest.
* The maximum interest of 48 suggests occasional peaks of interest, although these are not as extreme as in other keywords.

**Conclusion:**

Overall, 'Sustainable food production' shows consistent moderate interest with some variability. The stability of interest suggests sustained attention to this subject, making it a potentially valuable area for further exploration and potential inclusion in prototype development.

**Linear Regression Trend Analysis**
* Assess change in interest over time using a statistical model.
* Evaluate suitability of keywords for further analysis and prototype development based on general trend over the last 3 years.

In [None]:
# Get interest trends for moderate interest needs
analyze_interest_trends(moderate_relevance_df, slope_decimals=4)

**Observations:**
* Interest in the proposed need **'Sustainable food production'** is increasing moderately at a yearly rate of ~5.85 units (more than 5% every year).
* The trend is statistically significant (p-value < 0.05).

**Conclusion:**

The moderate increase in interest in 'Sustainable food production' highlights this need becomes more relevant over time, making it relevant for further analysis and inclusion in the prototype design.

**Trend Analysis Using Peak Detection**

In [None]:
# Calculate peaks and troughs in moderate interest needs
peaks_throughs_dict = get_peaks_and_troughs(moderate_relevance_df)
peaks_throughs_dict.keys()

**Peak Detection Analysis for 'Sustainable food production'**

In [None]:
# Show dataframe with boolean columns for peaks and troughs
peaks_throughs_dict['Sustainable food production'].head(3)

In [None]:
# Calculate average peaks and troughs and show dates with highest and lowest interest
average_peaks, average_troughs, peaks_by_interest, troughs_by_interest = (
    analyze_peaks_and_troughs(peaks_throughs_dict['Sustainable food production'])
)

In [None]:
# Calculate growth rate between peaks from 2021 to 2024
growth_rate = (average_peaks.iloc[-1] - average_peaks.iloc[0]) / average_peaks.iloc[0] * 100
print(f'Percentage change in average peak interest: {round(growth_rate, 2)}%')

In [None]:
# Calculate the highest and lowest interest value
print(f'Highest interest: {peaks_by_interest.iloc[0]["Interest"]}')
print(f'Lowest interest: {troughs_by_interest.iloc[0]["Interest"]}')

**Observations:**
* Interest in 'Sustainable food production' has shown a clear upward trend over the past three years, with the highest peak (48) occurring recently on April 28, 2024.
* The lowest interest (0) was recorded on July 18, 2021.
* Average peak interest has more than doubled from 19 in 2021 to 40 in 2024.
* Average trough interest has also increased steadily, from 8 in 2021 to 31 in 2024.

**Conclusion:**

The consistent growth in both peak and trough interest levels demonstrates a sustained and growing interest in 'Sustainable food production'. The recent peak interest further reinforces the relevance of this need. This makes it a strong candidate for inclusion in further analysis.


**Current Interest Trend Analysis**
* Determine the current direction of interest (increasing or decreasing).
* Identify the date when the current trend began.

**Current Interest Trend Analysis for 'Sustainable food production'**

In [None]:
determine_current_trend(peaks_throughs_dict['Sustainable food production'])

**Observations:**
* Interest in 'Sustainable food production' has increased by 4 units over the last week, reaching a current moderate value of 25.

**Conclusion:**
* This recent increase is consistent with the overall upward trend observed in the data over the past three years, suggesting a continued and gradual growth in the relevance of this need.
* Continued monitoring and analysis are required to fully understand the nature of this trend.

<a name='trends-high-needs'></a>
#### Trend Analysis for Needs with High Interest
- Assess whether interest has been increasing, decreasing, or stable over the past 3 years.
- Identify the peaks and troughs in interest levels.
- Determine the current trend in interest, whether it is rising or falling.

In [None]:
# Show all needs with high relevance
high_relevance_df = get_relevant_keywords_df(needs_by_relevance, needs_relevance_over_time, 'High')
high_relevance_df.head(3)

**Summary Statistics**
* Evaluate summary statistics to get an overview of the data's characteristics.

In [None]:
# Get summary stats
summary_stats_high_relevance = high_relevance_df.describe().round().astype(int)
summary_stats_high_relevance

**Observations:**
- The data for **'Educational AI'** reveals a highly volatile interest pattern over the past three years, with an average weekly interest of 29 units.  
- The standard deviation of 32, spanning from 0 to 100, indicates a wide range of interest levels. This volatility is further supported by the fact that the lowest 25% of values are zero, while the top 75% are above 53.
- A median (14) considerably lower than the mean (29) suggests that the distribution is skewed to the right. This suggests that interest is typically low, but occasionally spikes to very high levels.

**Conclusion:**

Overall, interest in 'Educational AI' is sporadic, characterized by periods of very low or no interest and bursts of high interest. Further trend analysis is needed to determine if any patterns exist within this volatility and to assess the overall trajectory of interest in this topic over time.

#### Trend Analysis High Interest Needs
* Assess whether interest is increasing, decreasing, or stable over the past 3 years.
* Assess peaks and troughs in interest levels.
* Determine current trend in interest.

**Linear Regression Trend Analysis**
* Assess change in interest over time using a statistical model.
* Evaluate suitability of keywords for further analysis and prototype development based on general trend over the last 3 years.

In [None]:
# Perform linear regression analysis
analyze_interest_trends(high_relevance_df, slope_decimals=4)

**Observations:**
- The positive slope of 0.0927 signifies an upward trend in interest over time with a substantial yearly increase in interst of ~33.84 units.
- The very small p-value (1.09e-62) is far below the typical threshold of 0.05 for statistical significance, proving that the observed increase in interst is not due to chance.

**Conclusion:**

**'Educational AI'** is a prime candidate for consideration in the design process of the prototype and for further analysis, due rapid increase of interest over the last 3 years.

**Trend Analysis Using Peak Detection**

In [None]:
# Calculate peaks and troughs in high interest needs
peaks_throughs_dict = get_peaks_and_troughs(high_relevance_df)
peaks_throughs_dict.keys()

**Peak Detection Analysis for 'Educational AI'**

In [None]:
# Calculate average peaks and troughs and show dates with highest and lowest interest
average_peaks, average_troughs, peaks_by_interest, troughs_by_interest = (
    analyze_peaks_and_troughs(peaks_throughs_dict['Educational AI'])
)

In [None]:
# Calculate the highest and lowest interest value
print(f'Highest interest: {peaks_by_interest.iloc[0]["Interest"]}')
print(f'Lowest interest: {troughs_by_interest.iloc[0]["Interest"]}')

In [None]:
# Calculate growth rate between peaks from 2021 to 2024
growth_rate = (average_peaks.iloc[-1] - average_peaks.iloc[0]) / average_peaks.iloc[0] * 100
print(f'Percentage change in average peak interest: {round(growth_rate, 2)}%')

**Observations:**
* Interest in 'Educational AI' shows a steep and consistent upward trend over the past three years, with the highest peak (100) occurring recently on May 19, 2024.
* The average peak interest has increased dramatically, growing fivefold from 14 in 2021 to 91 in 2024. (550%)
* The increasing average trough values, particularly the notable rise to 80 in 2024, suggest that baseline interest in 'Educational AI' is also growing rapidly.

**Conclusion:**

The rapid and consistent growth in both peak and trough interest levels, coupled with the fact that  interest is currently at its highest average peak value, strongly indicate that 'Educational AI' is a high-priority need. This need is a prime candidate for inclusion in the prototype and further analysis.



**Current Interest Trend Analysis**
* Determine the current direction of interest (increasing or decreasing).
* Identify the date when the current trend began.

**Current Interest Trend Analysis for 'Educational AI'**

In [None]:
# Check if the current value is higher than the last trough value
determine_current_trend(peaks_throughs_dict['Educational AI'])

**Observation:**
* Interest in 'Educational AI' has seen a moderate increase of 10 units over the past two weeks, reaching a critically high value of 78.

**Conclusion:**

The current trend correlates with the steep upward trend observed over the last 3 years. Further investigation is required to determine the significance of this trend.

<a name='trends-critical-needs'></a>
#### Trend Analysis for Needs with Critical Interest
- Assess whether interest has been increasing, decreasing, or stable over the past 3 years.
- Identify the peaks and troughs in interest levels.
- Determine the current trend in interest, whether it is rising or falling.

In [None]:
# Show all needs with critical relevance
critical_relevance_df_needs = (
    get_relevant_keywords_df(needs_by_relevance, needs_relevance_over_time, 'Critical')
)
critical_relevance_df_needs.head(3)

**Summary Statistics**
* Evaluate summary statistics to get an overview of the data's characteristics.

In [None]:
# Get summary stats
summary_stats_critical_relevance = critical_relevance_df_needs.describe().round().astype(int)
summary_stats_critical_relevance

**Observations:**
* Weekly interest in 'Climate resilience' averages 56 units over the past three years, indicating consistent moderate to high interest.
* The standard deviation of 18, with values ranging from 17 to 100, suggests a moderate range of interest levels.
* The median value is 56, equal to the mean, indicating a relatively symmetrical distribution. However, outliers with exceptionally high interest skew the distribution slightly to the right, leading to a maximum interest value of 100.

**Conclusion:**

Overall, interest in 'Climate resilience' has been stable and moderately high over the past three years, with occasional spikes in interest. The distribution is generally symmetrical, reflecting consistent interest with some variability due to peak interest periods.

**Linear Regression Trend Analysis**
* Assess change in interest over time using a statistical model.
* Evaluate suitability of keywords for further analysis and prototype development based on general trend over the last 3 years.

In [None]:
# Get linear regression results
analyze_interest_trends(critical_relevance_df_needs, slope_decimals=4)

**Observations:**
* The positive slope of 0.0494 indicates an upward trend in interest over time, corresponding to a significant yearly increase of approximately 18.02 units.
* The extremely small p-value (2.70e-46) is well below 0.05, confirming that the observed increase in interest is statistically significant and not due to chance.

**Conclusion:**

The significant upward trend in interest over the past three years makes 'Climate resilience' a prime candidate for the prototype design process and for inclusion in further analysis.

**Trend Analysis Using Peak Detection**

In [None]:
# Calculate peaks and troughs in critical interest needs
peaks_throughs_dict = get_peaks_and_troughs(critical_relevance_df_needs)
peaks_throughs_dict.keys()

**Peak Detection Analysis for 'Climate resilience'**

In [None]:
# Calculate average yearly peaks and troughs and show dates with highest and lowest interest
average_peaks, average_troughs, peaks_by_interest, troughs_by_interest = (
    analyze_peaks_and_troughs(peaks_throughs_dict['Climate resilience'])
)

In [None]:
# Calculate the highest and lowest interest value
print(f'Highest interest: {peaks_by_interest.iloc[0]["Interest"]}')
print(f'Lowest interest: {troughs_by_interest.iloc[0]["Interest"]}')

In [None]:
# Calculate growth rate between peaks from 2021 to 2024
growth_rate = (average_peaks.iloc[-1] - average_peaks.iloc[0]) / average_peaks.iloc[0] * 100
print(f'Percentage change in average peak interest: {round(growth_rate, 2)}%')

**Observations:**
* The interest in 'Climate resilience' has shown a consistent upward trend over the past three years, peaking at 100 on April 21, 2024, and dipping to its lowest at 17 on December 26, 2021.
* This trend reflects a substantial growth of approximately 147.22%, indicating that interest in 'Climate resilience' has more than doubled since 2021.
* Despite 'Educational AI' having a higher growth rate of 550% compared to 'Climate Resilience's' 147% over the last three years, 'Climate Resilience' shows a higher baseline interest, never dropping below 17 units. This sustained interest suggests a more mature and established relevance compared to the emerging interest in 'Educational AI,' which started from 0.
* The increasing average trough values, particularly in 2024, suggest that also the overall interest in 'Climate resilience' is rapidly growing, not just the peaks.

**Conclusion:**

The average yearly peak values for 'Climate resilience' have more than doubled over the past three years, demonstrating a substantial and consistent increase. This trend, along with the relatively high baseline interest, makes 'Climate resilience' a prime candidate for inclusion in prototype design and further analysis.


**Current Interest Trend Analysis**
* Determine the current direction of interest (increasing or decreasing).
* Identify the date when the current trend began.

**Current Interest Trend Analysis for 'Climate resilience'**

In [None]:
# Check if the current value is higher than the last trough value
determine_current_trend(peaks_throughs_dict['Climate resilience'])

**Observations:**
* Interest in 'Climate resilience' has increased by 8 units over the past two weeks, reaching a current value of 88 as of 06/23/2024.
* Compared to other needs, 'Climate resilience' currently has the highest relevance.

**Conclusion:**

The recent steady increase in interest for 'Climate resilience' aligns with the upward trend observed over the last three years. This sustained growth highlights the need's growing relevance and justifies its inclusion in the prototype design process and further analysis.

<a name='needs-visualizations'></a>
### Visualizations: Interest Over Time For Key Needs
* Aggregate the data over monthly intervals.
* Use a heatmap to show patterns in interest over time.

<a name='visualizations-moderate-needs'></a>
#### Visualization for Moderate Interest Needs
* Visualize weekly interest over 3 years

In [None]:
# Show moderate interest needs data
moderate_relevance_df.head(3)

In [None]:
# Create a heatmap to show interest over time
create_heatmap(moderate_relevance_df, 'Moderate', category='Needs')

**Observations:**
  * The visualization reveals a consistent, though gradual, increase in interest in 'Sustainable food production' over the past three years. Interest has risen from low levels (0 - 20) in 2021 to moderate-high levels (30-50) in 2024.
  * There is a noticeable seasonal pattern, with the lowest interest typically observed in the summer months (July-September) and the highest interest during the spring (February-June).

**Conclusion:**

 The steady growth of interest in 'Sustainable food production,' coupled with its seasonal fluctuations, indicates a promising area for further analysis.  The seasonal variation could be attributed to planting and harvesting seasons, although other factors could also play a role.This need should be included in further analysis. This analysis should explore the factors driving the observed growth and seasonality in interest.





<a name='visualizations-high-needs'></a>
#### Visualization for High Interest Needs

In [None]:
# Show high interest needs data
high_relevance_df.head(3)

In [None]:
# Create a heatmap to show interest over time
create_heatmap(high_relevance_df, 'High', category='Needs')

**Observations:**
  * The visualization reveals rapid growth in interest in 'Educational AI' from January 2023 until June 2024. Interest remained low from June 2021 until December 2022 (0 - 20) and started to grow rapidly from 20 - 100 in throughout the years 2023 and mid-2024.
  * The data shows no noticable seasonal pattern, and the overall highest interst has been obseved in April 2024.
  * Over the last 2 months interst levels decreased from 100 to 80, which suggests the potential for a drop in interest.

**Conclusion:**

 The rapid and consistent growth in interest in 'Educational AI' makes this need a crucial area for further analysis. This need should be considered for inclusion in the MVP design, and additional analysis should explore the factors driving the observed rapid growth and the potential for a future decrease in interest. It could likely be caused by innovation in AI technology, but could be subject to other factors as well.





<a name='visualizations-critical-needs'></a>
#### Visualization for Critical Interest Needs

In [None]:
# Show critical interest needs data
critical_relevance_df_needs.head(3)

In [None]:
# Create a heatmap to show interest over time
create_heatmap(critical_relevance_df_needs, 'Critical', category='Needs')

**Observations:**
  * The visualization reveals rapid growth in interest in 'Climate resilience' from June 2021, increasing from initial levels of 0-20 to critically high levels of 80-100 by October 2022, peaking in April 2024 and maintaining these levels into June 2024.
  * The absence of seasonal patterns suggests consistent underlying drivers.
  * Currently, interest remains critically high, with no apparent signs of decline.

**Conclusion:**

The persistent and rapid growth in interest for 'Climate resilience' underscores its critical importance for MVP design and further analysis. This heightened interest is likely driven by increased awareness of climate change effects and a growing desire for action. Other contributing factors may include technological advancements, government policies, and increased media coverage. Continuous monitoring using real-time data and predictive modeling can help identify the key drivers of this trend and predict future changes.


<a name='contextual-exploration'></a>
## Contextual Exploration of Most Relevant Keywords
* Present most relevant keywords out of all categories (critical interest keywords)
* Define relevant categories and sub-categories of keywords
* Explore and interpret trending related searches

<a name='most-relevant-keywords'></a>
#### Summary Table: Most Relevant Keywords
* Present and compare the most relevant keywords for features, problems and needs

In [None]:
# Show critical relevance keywords for features
features_by_interest[features_by_interest['relevance']=='Critical']

In [None]:
# Show critical relevance keywords for problems
problems_by_relevance[problems_by_relevance['relevance']=='Critical']

In [None]:
# Show critical relevance keywords for needs
needs_by_relevance[needs_by_relevance['relevance']=='Critical']

In [None]:
# Combine all critical relevance keywords into one dataframe and add the a column for the category
critical_interest_keywords = pd.concat([
    features_by_interest[features_by_interest['relevance'] == 'Critical'].assign(category='Feature'),
    problems_by_relevance[problems_by_relevance['relevance'] == 'Critical'].assign(category='Problem'),
    needs_by_relevance[needs_by_relevance['relevance'] == 'Critical'].assign(category='Need')
])

# Remove unnecessary columns and sort by weekly_average_interest
critical_interest_keywords = (
    critical_interest_keywords
    .drop(columns=['interest', 'relevance'])
    .sort_values(by='weekly_average_interest', ascending=False)
)

critical_interest_keywords

<a name='categories'></a>
#### Exploration of the Context of Keyword Searches
* Get Google Search suggestions to retrieve a list of related keywords.
* Analyze suggestions to explore user search intent.
* Get Google Search categories for main keyword and suggestions.
* Analyze categories to explore context of interest.

#### Evaluation of Search Suggestions

In [None]:
# Get a list of all critical relevance keywords
critical_interest_keywords_list = critical_interest_keywords.index.tolist()
critical_interest_keywords_list

In [None]:
# Initialize an empty dictionary to store suggestions
suggestions_dict = {}

In [None]:
# Get suggestions for all keywords in the list

# Iterate over each keyword in the list
for keyword in critical_interest_keywords_list:
    suggestions = pytrends.suggestions(keyword=keyword)
    suggestions_df = pd.DataFrame(suggestions).drop(columns=['mid'])
    suggestions_dict[keyword] = suggestions_df

# Return the dictionary of suggestions dataframes
suggestions_dict.keys()

**Search Suggestions for 'Solar powered'**

In [None]:
# Return the results for 'Solar powered'
suggestions_dict['Solar powered']

**Observations:**
* The capitalized 'Solar Powered' is a minor variation, likely used interchangeably by searchers. It confirms the term's validity and offers a slight keyword variation for later use.
* The presence of 'Solar energy' indicates a significant interest in understanding the benefits and potential of solar technology, beyond simply using solar-powered products.
* The variety of specific solar-powered products, including 'Solar-powered pump', 'Solar-powered fan' and 'Solar-powered Calculator', suggests a demand for diverse applications, including outdoor and computing appliances. This aligns with the concept of a solar-powered garden robot.
* Further analysis should identify potential customer groups interested in various solar-powered applications, including gardening robots, and assess the competitive landscape of solar-powered products.

**Conclusion:**

Overall solar-powered is a valuable feature to include in the prototype design, as there is demand for solar-powered appliances as well as interest in solar-powered technology.







In [None]:
# Adjust pandas display settings to show more content
pd.set_option('display.max_colwidth', None)

**Search Suggestions for 'Plant detection'**

In [None]:
# Return the suggestions for 'Plant detection'
suggestions_dict['Plant detection']

**Interpretation of Results: 'Plant detection'**
* The search suggestions for 'Plant detection' are all books, indicating a strong academic and research interest. This suggests that the feature is still in development, with few commercial products available.
* The book titles suggest that target audience are researchers, scientists, and students in fields like botany, agriculture, or computer vision, rather than general consumers.
* The results suggest potential collaborators in the fields of machine learning and biology, as books from both fields appear in the results. Researchers in these areas could also be potential customers for the product, as they may have an interest in testing and understanding advanced plant detection technologies.
* These specific results, suggest a scientific niche audience for this keyword.
* The search results for highlight the field's interdisciplinary nature, encompassing machine learning, medical applications and genetics and show a focus on plant disease detection (Balaji Aglave)

Conclusion:
* Users searching for 'Plant detection' seek research information rather than consumer products, suggesting a need to explore why there's a lack of product searches and if researchers would be interested in consumer products like a plant-detection gardening robot.

**Search Suggestions for 'Climate resilience'**

In [None]:
# Return suggestions for 'Climate resilience'
suggestions_dict['Climate resilience']

**Interpretation of Results: 'Climate resilience'**
* The search suggestions for 'Climate resilience' highlight significant interest in both personal and community strategies to address environmental challenges.
*  Climate resilience' is identified as a key topic by Google, indicating a significant and growing interest in this area.
* All the suggested books focus on self-help and emotional resilience, reflecting people's interest in taking action to cope with societal and environmental challenges. This interest highlights the need for practical solutions that individuals and communities can use to enhance resilience.
* Saket Soni, one of the suggestions, is the founder of Resilience Force, an NGO that trains disaster recovery workers to enhance resilience. A gardening robot could support their work by improving food security and sustainability in post-disaster scenarios through climate-resilient food systems.


Conclusion:
* The results suggest a potential market for a gardening robot capable of building climate-resilient and adaptive food systems. Such a product aligns with a broad audience’s proactive mindset by providing a practical way to enhance food security and sustainability, empowering individuals and communities to take meaningful action toward a resilient future.

**Search Suggestions for 'Monoculture in agriculture'**

In [None]:
# Return suggestions for 'Monoculture in agriculture'
suggestions_dict['Monoculture in agriculture']

**Interpretation of Results: 'Monoculture in agriculture'**
* The suggested books and topics collectively highlight public awareness of the various problems associated with the problem of 'Monoculture in agriculture', such as its impact on diet, the environment, and disease spread.
* All of the suggested books emphasize the need for sustainable farming practices that promote biodiversity and ecological balance. The proposed gardening-robot could solve this problem by designing and creating polyculture systems that provide nutritious food and maintain yields sustainably.

Conclusion:

* The recognition of the problem of monoculture in agriculture resonates with both consumers seeking healthy food and farmers looking for more resilient food systems. The suggestions highlight the widely acknowledged negative effects of monoculture on the environment and health, underscoring the need for solutions that balance sustainable food systems with farmers' ability to maintain profitability. This indicates strong interest in a gardening robot that can create polyculture systems, delivering nutritious food and yields comparable to monoculture. This robot could be of interest for both consumers and farmers.

#### Evaluation of Categories of Interest

* Determine which categories Google would place keywords in based on suggestions and the keywords themselves.
* Search the categories dictionary recursively for occurences of strings in the suggestions or the search term itself.
* Interpret results and define which categories are relevant for further analyis.

In [None]:
# Get a nested dictionary of all available Google Search categories
categories = pytrends.categories()
categories.keys()

In [None]:
def find_matched_categories(suggestion_keywords_list, categories=categories, stopwords=None):
    """
    Finds and returns categories from a hierarchical category structure
    that match any word from the given list of suggestion keywords,
    excluding common stopwords.

    This function iterates through a nested dictionary of categories,
    checking each category name (and subcategory names) to see if it
    contains any word from the provided list. It captures the matched
    categories along with their IDs and parent category names.

    Args:
        suggestion_keywords_list (list of str): A list of keywords to search for in category names.
        categories (dict): A nested dictionary structure representing categories and subcategories.
                           This should follow the format returned by pytrends.categories().
        stopwords (set of str): A set of words to ignore during the matching process.

    Returns:
        list of dict: A list of dictionaries, each containing information about a matched category,
                      including the category name, category ID, and a list of parent categories.
    """
    matched_categories = []

    # Define default stopwords if none are provided
    if stopwords is None:
        stopwords = {'and', 'in', 'the', 'of', 'for', 'to', 'a'}

    # Create a set of individual words from all keywords, excluding stopwords
    suggestion_words_set = set()
    for keyword in suggestion_keywords_list:
        words = keyword.split()  # Split keyword into words
        filtered_words = [word.lower() for word in words if word.lower() not in stopwords]
        suggestion_words_set.update(filtered_words)  # Add non-stopwords to the set

    # Iterate over each top-level category
    for top_category in categories.get('children', []):
        top_category_name = top_category.get('name', '').lower()
        top_category_id = top_category.get('id', '')

        # Check if any suggestion word is in the top-level category name
        if any(word in top_category_name for word in suggestion_words_set):
            matched_categories.append({
                'category_name': top_category.get('name', ''),
                'category_id': top_category_id,
                'parent_categories': []  # No parent categories here
            })

        # Iterate over each subcategory of the current top-level category
        for sub_category in top_category.get('children', []):
            sub_category_name = sub_category.get('name', '').lower()
            sub_category_id = sub_category.get('id', '')

            # Check if any suggestion word is in the subcategory name
            if any(word in sub_category_name for word in suggestion_words_set):
                matched_categories.append({
                    'category_name': sub_category.get('name', ''),
                    'category_id': sub_category_id,
                    'parent_categories': [top_category.get('name', '')]
                })

            # Further iterate over sub-subcategories if any
            for sub_sub_category in sub_category.get('children', []):
                sub_sub_category_name = sub_sub_category.get('name', '').lower()
                sub_sub_category_id = sub_sub_category.get('id', '')

                # Check if any suggestion word is in the sub-subcategory name
                if any(word in sub_sub_category_name for word in suggestion_words_set):
                    matched_categories.append({
                        'category_name': sub_sub_category.get('name', ''),
                        'category_id': sub_sub_category_id,
                        'parent_categories': [top_category.get('name', ''), sub_category.get('name', '')]
                    })

    return matched_categories

**Interest Categories for 'Solar powered'**

In [None]:
# Store suggestions for 'Solar powered' in a list
solar_powered_suggestions = suggestions_dict['Solar powered']['title'].tolist()
solar_powered_suggestions

In [None]:
# Keep only general terms for broader category research
solar_powered_suggestions = solar_powered_suggestions[:2]
solar_powered_suggestions

In [None]:
# Return matched categories for 'Solar powered'
categories_solar_powered = find_matched_categories(solar_powered_suggestions)
categories_solar_powered = pd.DataFrame(categories_solar_powered)
categories_solar_powered

In [None]:
# Store relevant categories (all are relevant here)
categories_solar_powered

**Observations:**
* The analysis identified three relevant categories for users interested in solar power: 'Energy & Utilities', 'Renewable & Alternative Energy' and 'Nuclear Energy'.
* This broad category 'Energy & Utilities' shows a general interest in energy topics among those interested in solar power.
* 'Nuclear Energy' suggests users are interested in comparing solar power to other energy sources, potentially influencing their decision-making when investing in solar-powered technology.
*  'Renewable & Alternative Energy' directly aligns with solar power as a renewable energy source, indicating a strong interest in this broader field.
* All categories are relevant for analyzing the broader interests of people interested in 'Solar powered'.

**Conclusion:**

The diverse areas of interest associated with solar power, spanning from the general 'Energy & Utilities' to the specific 'Renewable & Alternative Energy' and comparative 'Nuclear Energy,' indicate a diverse audience with wide-ranging motivations. This suggests potential customers for solar-powered products may have varying priorities, like general energy concerns, researching specific appliances or interest in renewable energy sources compared to alternatives.

**Interest Categories for 'Plant detection'**

In [None]:
# Return matches for 'Plant detection'
plant_detection_suggestions = suggestions_dict['Plant detection']['title'].tolist()
plant_detection_suggestions

In [None]:
# Remove the first keyword because it refers to another meaning of 'plant'
plant_detection_suggestions.pop(0)
plant_detection_suggestions

In [None]:
# Extract 'Machine Learning' as it is the only relevant part for plant detection
plant_detection_suggestions[0] = ' '.join(plant_detection_suggestions[0].split()[:2])
plant_detection_suggestions

In [None]:
# Return matching categories for 'Plant detection'
categories_plant_detection = find_matched_categories(plant_detection_suggestions)
categories_plant_detection = pd.DataFrame(categories_plant_detection)
categories_plant_detection

In [None]:
# Store relevant categories
categories_plant_detection = categories_plant_detection[categories_plant_detection['category_name'].isin(['Biological Sciences', 'Genetics', 'Machine Learning & Artificial Intelligence' ])]
categories_plant_detection

**Interpretation of Results: 'Plant detection'**

* The analysis identified 8 categories for 'Plant detection', of which only 'Biological Sciences', 'Machine Learning & Artifical Intelligence' and 'Genetics' are relevant, as they directly align with the core research areas involved in plant detection.
* 'Biological Sciences' encompasses the study of plants, their structures, and their interactions with the environment, knowledge essential for creating accurate and effective plant detection algorithms.
* 'Machine Learning & Artifical Intelligence' are essential for developing plant detection algorithms, particularly the sub-field of computer vision.
* 'Genetics' is crucial for identifying and classifying plant species, as well as detecting genetic variations or diseases.
* 'Biological Sciences', 'Machine Learning & Artifical Intelligence' and 'Genetics' are relevant for analyzing the broader interests of people interested in 'Plant detection'.


**Interest Categories for 'Climate resilience'**

In [None]:
# Show suggestions for 'Climate resilience'
climate_resilience_suggestions = suggestions_dict['Climate resilience']['title'].tolist()
climate_resilience_suggestions

In [None]:
# Remove irrelevant book titles (novel/self-help titles) and author name
climate_resilience_suggestions = ['Climate resilience']
climate_resilience_suggestions

In [None]:
# Return matching categories for 'Climate resilience'
categories_climate_resilience = find_matched_categories(climate_resilience_suggestions)
categories_climate_resilience = pd.DataFrame(categories_climate_resilience)
categories_climate_resilience

In [None]:
# Store relevant categories
categories_climate_resilience = categories_climate_resilience[categories_climate_resilience['category_name'].isin(['Climate Change & Global Warming'])]
categories_climate_resilience

**Interpretation of Results: 'Climate resilience'**
* The analysis identified 3 categories for 'Climate resilience', of which only 'Climate Change & Global Warming' is relevant.
* This category is the most relevant as climate resilience is a strategy for mitigating the impacts of climate change and global warming.
* 'Climate Change & Global Warming' is relevant for analyzing the broader interests of people interested in 'Climate resilience'.



**Interest Categories for 'Monoculture in agriculture'**

In [None]:
# Show suggestions for 'Monoculture in agriculture'
monoculture_in_agriculture_suggestions = suggestions_dict['Monoculture in agriculture']['title'].tolist()
monoculture_in_agriculture_suggestions

In [None]:
# Remove everything but 'Future of Food' from the first book title, title has no direct relevance
monoculture_in_agriculture_suggestions[0] = ' '.join(monoculture_in_agriculture_suggestions[0].split()[6:9])
monoculture_in_agriculture_suggestions

In [None]:
# Remove names of specific crops from the booktitles, just one example of many crops
monoculture_in_agriculture_suggestions[2] = monoculture_in_agriculture_suggestions[2].replace('Delicata Squash', ' ')
monoculture_in_agriculture_suggestions

In [None]:
# Remove 'Purslane: Guide and Overview', same as other booktitle with different crop
monoculture_in_agriculture_suggestions.pop(3)
monoculture_in_agriculture_suggestions

# Remove the last item, refers to a specific crop, not relevant for categorization
monoculture_in_agriculture_suggestions.pop(-1)
monoculture_in_agriculture_suggestions

In [None]:
# Return matching categories for 'Monoculture in agriculture'
categories_monoculture_in_agriculture = find_matched_categories(monoculture_in_agriculture_suggestions, stopwords={'on', 'and', 'the', 'of', 'make', 'big'})
categories_monoculture_in_agriculture = pd.DataFrame(categories_monoculture_in_agriculture)
categories_monoculture_in_agriculture

In [None]:
# Store relevant categories
categories_monoculture_in_agriculture = categories_monoculture_in_agriculture[categories_monoculture_in_agriculture['category_name'].isin(['Food & Drink', 'Food Production', 'Biological Sciences'])]
categories_monoculture_in_agriculture


**Interpretation of Results: 'Monoculture in agriculture'**
* The analysis identified 25 categories for 'Monoculture in agriculture', of 'Food production', 'Food & Drink' and 'Biological Sciences' have the most relevance.
* While multiple scientific disciplines relate to monoculture, 'Biological Sciences' was chosen for this analysis as it is the most relevant for prototype development, dealing with the core aspect of plant interactions with their environment and how to use that knowledge to find alternatives to monoculture in agriculture.
* 'Food production' is directly relevant as monoculture is a dominant practice in industrial agriculture. People interested in alternatives are likely exploring different production methods.
* 'Food & Drink' captures interests in sustainable, organic, or local food, which often intersect with concerns about monoculture's impact on food systems.
* 'Food production', 'Food & Drink' and 'Biological Sciences' are relevant for analyzing the broader interests of people interested in 'Monoculture in agriculture'.



<a name='presentation-of-results'></a>
## Presentation of Results & Insights
* Summarize the key findings and insights gathered from the analysis of keyword performance, interest trends, and the contextual exploration of user interests.
* Show best performing keywords based on interest over time.
* Show interest trends and patterns for most relevant keywords.
* Explore target audience interest through keyword context analysis.
* Visualize insights on keyword performance, trends and context.


<a name='keyword-performance'></a>
### Keyword Performance Results
* Show insights on keywords with no relevance.
* Show best performing keywords out of all categories (features, problems and needs).
* Interpret the results.

#### Insights on Keyword with No Interest
* The choice of keywords is too complex or too niche.
* The number of words in a keyword negatively affect its effectiveness.
* The observations are true for keywords in the *features, problems* and *needs* groups.

* *Recommended Next Steps*
  1. Reduce word count in keywords to no more than 3
  2. Use `pytrends.suggestions` on the full keyword list to refine keyword phrasing and expand reach to reflect desired context.
* Full analysis at:
[Brief Evaluation of Keywords with No Relevance](#data-no-interest)

In [None]:
# Show keywords in the features group that received no interest
zero_relevance_features.head(3)

In [None]:
# Show highest performing keywords in the features group
round(features_by_interest['weekly_average_interest'].head(3)).astype(int)

In [None]:
# Show keywords that received no interest in the needs group
zero_relevance_needs.head(3)

In [None]:
# Show highest interest keywords in the needs group
round(needs_by_relevance['weekly_average_interest'].head(3)).astype(int)

In [None]:
# Show keywords that received no interest in the problems group
zero_relevance_problems.head(3)

In [None]:
# Show highest interest keywords in the problems group
round(problems_by_relevance['weekly_average_interest'].head(3)).astype(int)

#### Best Performing Keywords
**Insigths:**

* The majority of keywords in all groups (80% and above) received low interest.
* The keywords with absolute highest interest were found in the *features* group. It includes keywords of 'Moderate', 'High' and 'Critical' interest.
* The *problems* group is the group with lowest absolute interest, where the highest interest value is almost 4 times (390%) smaller than the highest interest value in the *features* group. It includes only one 'Critical' relevance keyword, as all other keywords in the group have a relevance of less than 10% of this keyword.
* The *needs* group shows the median absolute interest values, it includes keywords in all categories ('Moderate', 'High' and 'Critical').
* Moderate Interest Keywords (2 total)
  1. The need 'Sustainable food production'.
  2. The feature 'Food system health'.
* High Interest Keywords (4 total)
  1. The feature 'Educational value'
  2. The feature 'Plant identification'
  3. The feature 'Remote access and control'
  4. The need 'Educational AI'
* Critical Interest Keywords (4 total)
  1. The feature 'Solar powered'
  2. The feature 'Plant detection'
  3. The need 'Climate resilience'
  4. The problem 'Monocultulture in agriculture'

**Recommended next steps:**
  1. Focus further analysis only on keywords in the 'Moderate', 'High' or 'Critical' interest categories.
  3. Prioritize keywords by these categories in further analysis steps.
  2. Revise the initial keyword list for the *problems* group and find keywords with the potential to yield higher absolute interest values.

**Links to Detailed Analysis:**

* Analysis for *features* group: [Determine Features with Highest Relevance](#most-relevant-features)
* Analysis for *problems* group: [Definition of Most Relevant Problems](#most-relevant-problems)
* Analysis for *needs* group: [Definition of Most Relevant Needs](#most-relevant-needs)

*Note:*

Interest categories are determined relative to highest interst in every group, therefore 'Critical' interest in the *problems* group is equivalent to 'Modest' interest in the *features* group.

**Percentage of Keywords with 'Low' Interest**

In [None]:
# Calculate percentage of low interest needs
low_relevance_needs = round((needs_by_relevance['relevance'] == 'Low').mean() * 100)
print(f'Percentage of low relevance needs: {low_relevance_needs}%')
print("-" * 40)

# Calculate percentage of low interest features
low_relevance_features = round((features_by_interest['relevance'] == 'Low').mean() * 100)
print(f'Percentage of low relevance features: {low_relevance_features}%')
print("-" * 40)

# Calculate percentage of low interest problems
low_relevance_problems = round((problems_by_relevance['relevance'] == 'Low').mean() * 100)
print(f'Percentage of low relevance problems: {low_relevance_problems}%')


**'Moderate' Interest Keywords**

In [None]:
# Combine all 'Moderate' relevance keywords into one dataframe and add the a column for the category
moderate_interest_keywords = pd.concat([
    features_by_interest[features_by_interest['relevance'] == 'Moderate'].assign(category='Feature'),
    problems_by_relevance[problems_by_relevance['relevance'] == 'Moderate'].assign(category='Problem'),
    needs_by_relevance[needs_by_relevance['relevance'] == 'Moderate'].assign(category='Need')
])

# Remove unnecessary columns and sort by weekly_average_interest
moderate_interest_keywords = (
    moderate_interest_keywords
    .drop(columns=['interest', 'relevance'])
    .sort_values(by='weekly_average_interest', ascending=False)
)

# Show the resulting dataframe
moderate_interest_keywords

**'High' Interest Keywords**

In [None]:
# Combine all 'High' relevance keywords into one dataframe and add the a column for the category
high_interest_keywords = pd.concat([
    features_by_interest[features_by_interest['relevance'] == 'High'].assign(category='Feature'),
    problems_by_relevance[problems_by_relevance['relevance'] == 'High'].assign(category='Problem'),
    needs_by_relevance[needs_by_relevance['relevance'] == 'High'].assign(category='Need')
])

# Remove unnecessary columns and sort by weekly_average_interest
high_interest_keywords = (
    high_interest_keywords
    .drop(columns=['interest', 'relevance'])
    .sort_values(by='weekly_average_interest', ascending=False)
)

# Show the resulting dataframe
high_interest_keywords

**'Critical' Interest Keywords**


In [None]:
critical_interest_keywords

**Visualization: Interest per Relevance Category**
* Visualize interest per category for all groups to compare absolute interest.

In [None]:
# Reset index for moderate interest dataframe
moderate_interest_keywords_copy = moderate_interest_keywords.copy()
moderate_interest_keywords_reset = moderate_interest_keywords_copy.reset_index()

# Reset index for high interest dataframe
high_interest_keywords_copy = high_interest_keywords.copy()
high_interest_keywords_reset = high_interest_keywords_copy.reset_index()

# Reset index for critical interest dataframe
critical_interest_keywords_copy = critical_interest_keywords.copy()
critical_interest_keywords_reset = critical_interest_keywords_copy.reset_index()

In [None]:
# Access the Set2 palette
set2_colors = sns.color_palette("Set2")

# Define a custom color palette using Set2 colors
color_palette = {
    'Feature': set2_colors[0],  # orange
    'Problem': set2_colors[1],   # blue
    'Need': set2_colors[2]    # green
}

# Create subplots
fig, axes = plt.subplots(1, 3, figsize=(10, 4), dpi=200, sharey=True)  # 1 row, 3 columns

# Moderate Interest Keywords Plot
sns.barplot(
    data=moderate_interest_keywords_reset,
    x='index',
    y='weekly_average_interest',
    hue='category',
    palette=color_palette,
    ax=axes[0]  # Specify the first subplot
)
axes[0].set_xlabel('Keyword', fontsize=8)
axes[0].set_ylabel('Weekly Average Interest', fontsize=8)
axes[0].set_title('Moderate Interest Keywords', fontsize=9)
axes[0].tick_params(axis='x', labelsize=8, rotation=45)
axes[0].legend(title='Category', prop={'size': 8}, title_fontsize=9)
axes[0].set_ylim(0, 100)  # Set y-axis limits to ensure a tick at 100

# High Interest Keywords Plot
sns.barplot(
    data=high_interest_keywords_reset,
    x='index',
    y='weekly_average_interest',
    hue='category',
    palette=color_palette,
    ax=axes[1]  # Specify the second subplot
)
axes[1].set_xlabel('Keyword', fontsize=8)
axes[1].set_ylabel('Weekly Average Interest', fontsize=8)
axes[1].set_title('High Interest Keywords', fontsize=9)
axes[1].tick_params(axis='x', labelsize=8, rotation=45)
axes[1].legend(title='Category', prop={'size': 8}, title_fontsize=9)
axes[1].set_ylim(0, 100)  # Set y-axis limits to ensure a tick at 100

# Critical Interest Keywords Plot
sns.barplot(data=critical_interest_keywords_reset,
            x='index',
            y='weekly_average_interest',
            hue='category',
            palette=color_palette,
            ax=axes[2]  # Specify the third subplot
            )
axes[2].set_xlabel('Keyword', fontsize=8)
axes[2].set_ylabel('Weekly Average Interest', fontsize=8)
axes[2].set_title('Critical Interest Keywords', fontsize=9)
axes[2].tick_params(axis='x', labelsize=8, rotation=45)
axes[2].legend(title='Category', prop={'size': 8}, title_fontsize=9)
axes[2].set_ylim(0, 100)  # Set y-axis limits to ensure a tick at 100

# Remove individual legends
for ax in axes:
    ax.get_legend().remove()

# Collect all handles and labels for the legend
handles, labels = [], []
for ax in axes:
    for handle, label in zip(*ax.get_legend_handles_labels()):
        if label not in labels:
            handles.append(handle)
            labels.append(label)

# Create a single legend for the entire figure
fig.legend(handles, labels, loc='upper center', fontsize=8, title='Category', title_fontsize=9, ncol=3)

plt.tight_layout(rect=[0, 0, 1, 0.9])
plt.subplots_adjust(top=0.75, wspace=0.4)

plt.show()

**Observation:**
* For the 'Moderate' interest keywords the absolute average interest is between approximately 15 and 25. Absolute interest in the keywords in the *needs* group is slightly higher than in the *feature* keyword.
* For the 'High' interest keywords the absolute average weekly interest is between approximately 25 and 60. All *features* keywords received significantly more interest than the *needs* keyword in this category.
* For the 'Critical' interest features absolute average weekly interest is in a wide range between 25 and 80. This is due to the *problem* with the highest interest having significantly less interest than the keywords in the other groups. The problem remains in this category due to the need to include at least one keyword out of every group for further analysis.

<a name='interest-trends'></a>
### Interest Trend Analysis Results
* Show trends in interest in high performing keywords in the groups *features*, *problems* and *needs*.
* Compare trend analysis results of the three groups.

In [None]:
# Visualize all feature trends in one plot
def create_heatmaps(dfs, interest_levels, categories):
    """
    This function takes multiple DataFrames, resamples the data to a monthly frequency,
    transposes them, and creates a heatmap for each in a subplot.

    Parameters:
    dfs (list of pd.DataFrame): A list of input DataFrames with a date index and keyword columns.
    interest_levels (list of str): A list of interest level descriptions for the heatmap titles.
    categories (list of str): A list of categories for the interest levels ('Features', 'Problems', 'Needs').

    Returns:
    None
    """
    # Check that all input lists have the same length
    if not (len(dfs) == len(interest_levels) == len(categories)):
        raise ValueError("All input lists must have the same length.")

    num_plots = len(dfs)  # Determine the number of plots needed

    # Create subplots with one row per DataFrame
    fig, axes = plt.subplots(num_plots, 1, figsize=(8, 1.5 * num_plots), dpi=150)  # Dynamic height based on number of plots

    # If there's only one plot, ensure axes is iterable
    if num_plots == 1:
        axes = [axes]

    for i in range(num_plots):
        df = dfs[i]
        interest_level = interest_levels[i]
        category = categories[i]

        # Ensure the date column is set as index
        if not isinstance(df.index, pd.DatetimeIndex):
            raise ValueError(f"The DataFrame at index {i} does not have a DatetimeIndex.")

        # Resample the data to a monthly frequency and take the mean
        df_monthly = df.resample('M').mean()

        # Transpose the DataFrame for heatmap
        df_transposed = df_monthly.T

        # Change the column labels to display month and year
        df_transposed.columns = df_transposed.columns.strftime('%b %Y')

        # Create the heatmap in the corresponding subplot
        sns.heatmap(df_transposed, cmap='viridis', cbar=True, vmin=1, vmax=100, ax=axes[i])
        axes[i].set_title(f'Interest in "{interest_level}" Interest {category} Over Time', fontsize=10)
        axes[i].set_xlabel('Date', fontsize=9)
        axes[i].set_ylabel('Keyword', labelpad=10, fontsize=9)
        axes[i].set_xticklabels(axes[i].get_xticklabels(), rotation=90, ha='right', fontsize=7)
        axes[i].set_yticklabels(axes[i].get_yticklabels(), fontsize=8)


    # Use subplots_adjust instead of tight_layout
    plt.subplots_adjust(left=0.05, right=0.9, top=1.9, bottom=0, hspace=0.55)  # Increase spacing between plots

    plt.show()


#### Interest Trend Analysis Results for Features
* Visualize and summarize results found in trend analysis for features.

**Visualization of Trends in all Features Keywords**

In [None]:
create_heatmaps([moderate_interest_df, high_interest_df, critical_interest_df], ['Moderate', 'High', 'Critical'], ['Features', 'Features', 'Features'])

**Summary of Trend Analysis Insights for all *Features* Keywords**

**Insigths:**
1. **Insights for 'Food system health':**
    * This is the only features in the group with 'moderate' interest
    * Interest is currently low but has been steadily increasing at a low rate of 1.74 units per year.
    * Interest has remained relatively stable over the last three years, with minimal fluctuation between peak and trough values.
    * A seasonal pattern is evident, with interest typically increasing throughout the year and dropping in the summer months.
    * Interest as of June 23, 2024, interest is at a low point of 11 units.
2. **Insights for 'Educational value'**:
    * This proposed feature is classified as a 'high' interest feature.
    * While the interest level is generally high, it does exhibit moderate variability with a standard deviation of 16.
    * This feature consistently receives high interest, with a mean interest of 56 over the past three years and a lowest interest value of 22.
    * Interest has been increasing at a moderate yearly rate of 3.38, although there was a notable dip in 2023.
    * Both the highest and lowest points of interest show an overall upward trend, indicating growing interest in this feature despite the fluctuations.
3. **Insights for 'Remote access and control'**:
    * This proposed feature is classified as a 'high' interest feature.
    * Interest in 'Remote access and control'  has been rapidly increasing at a rate of 19.36 over the last 3 years.
    * This feature demonstrates a stable baseline of interest, as it has not fallen below 33 units since 2022.
    * Currently, this feature has the highest level of interest (58) among all high-interest features and continues to grow.
4. **Insights for 'Plant identification'**:
    * This proposed feature is classified as a 'high' interest feature.
    * It demonstrates the most stable baseline interest among all features in this category, with a minimum interest level of 26 over the past three years.
    * While interest in 'Plant identification' has declined slightly at a rate of 4.6 units per year, the variation between peak and trough values remains low, suggesting overall stability.
    * Current interest in this feature (53) is comparable to other high-interest features, indicating relevance despite the decline.
5. **Insights for 'Plant detection':**
  * This proposed feature is classified as a 'critical' interest feature.
  * Over the past three years, interest in this feature has remained consistently high, with a lowest value of 23 and an average of 64.
  * Interest in 'Plant detection' has been increasing rapidly at a yearly rate of 12.38 units, with peak values growing by approximately 85% over the last three years.
  * Both peak and trough values have shown growth, indicating stable and consistent interest over time.
  * Currently, interest in this feature is high at 63 and continues to increase.
6. **Insights for 'Solar powered':**
  * This feature is classified as 'critical' interst feature.
  * 'Solar powered' recieved the highest overall interest among all features, with an average of 75 units.
  * Despite slight fluctuations, interest in this feature has remained relatively stable, with the lowest value over the past three years being 51 and a standard deviation of 12 units.
  * Although experiencing a slight decline of 2.54 units per year, 'Solar powered' still maintains the highest overall interest among all features, demonstrating its continued relevance.
  * As of June 23, 2024, interest in this feature is at a very high value of 89 and has been increasing for the past two weeks.

**Recommended next steps:**
  1. **Customer Segmentation Analysis:**
    * Focus on keywords with 'critical' or 'high' interest and prioritize keywords with rapidly rising interest and consistent interest. These are:
      * Educational value: Increasing and continuously high interest with moderate variability.
      * Remote access and control: Rapidly growing and stable baseline interest.
      * Plant identification: Stable, slighly declining interest.
      * Plant detection: Rapidly increasing, stable interest.
      * Solar powered: Highest overall interest and stable.
  2. **Prototype Development:**
    * It is too early to definitively decide which feature to include in the prototype design. However, some features that demonstrated particularly promising results in the trend analysis can be prioritized for consideration. These include:
      * Remote access and control: Rapidly growing and stable baseline interest.
      * Plant detection: Rapidly growing and increasing interst and current high interest.
      * Solar  powered: Highest overall interst, very high interst.
  3. **Further Evaluation:**
    * Keywords that received moderate interest but show a pattern of rising interest, should be further evaluated, to make a decision whether they should be considered for further analysis and prototype development. Keywords that require further contextual evaluation are:
      * 'Food system health': Evaluate if a further increase in interest is likely.
      * 'Plant identification': Evaluate the reason for the current decline in interest.
      * Solar powered: Evaluate the reason for the overall slight decline in interest and the significance of the current increase in interest.

**Links to Detailed Analysis:**
* Trend analysis for moderate interest features: [Analysis Moderate Interest Features](#trends-moderate-features)
* Visualizations for moderate interest features: [Visualization Moderate Interest Features](#features-moderate-visualizations)
* Trend analysis for high interest features: [Analysis High Interest Features](#trends-high-features)
* Visualizations for high interest features: [Visualization High Interest Features](#features-high-visualizations)
* Trend analysis for critical interest features: [Analysis Critical Interest Features](#trends-critical-features)
* Visualizatioins for critical interest features: [Visualization Critical Interest Features](#features-critical-visualizations)

*Note:*

Interest is measured on a scale of 0 to 100, reflecting the relative popularity of a search term compared to all other Google searches by location over a given week. A value of 100 indicates peak popularity for the selected timeframe (here data has been collected over 3 years) and location (here worldwide). For more detailed information, refer to the [Basics of Google Trends](https://newsinitiative.withgoogle.com/resources/trainings/google-trends/basics-of-google-trends/#:~:text=Indexing%3A%20Google%20Trends%20data%20is,the%20time%20and%20location%20selected.)


#### Interest Trend Analysis Results for Problems
* Visualize and summarize results found in trend analysis for problems.

**Visualization of Trends in all *Problems* Keywords**
* This visualization focuses solely on the trend data for 'Monoculture in agriculture', as it was identified as the only relevant keyword within the *problems* category due to very low overall interest in other suggested keywords.

In [None]:
create_heatmaps([critical_relevance_df], ['Critical'], ['Problems',])

**Insigths:**
1. **Insights for 'Monoculture in agriculture':**
  * This keyword is included in the analysis due to its highest ranking within the *problems* category, despite demonstrating low overall interest and no significant trends.
  * Interest in 'Monoculture in agriculture' is typically zero to low, with occasional spikes indicated by a high standard deviation of 28 and a full range of values (0-100).
  * In at least 75% of observations, there was no interest in this proposed problem.
  * Despite the overall low interest, this keyword has achieved the maximum possible interest value of 100 at least once in the last 3 years, suggesting the potential for significant but short-lived spikes likely driven by external events.
  * Linear regression analysis reveals no statistically significant trend, supporting the hypothesis of random spikes rather than a consistent pattern.
  * As of June 23, 2024, interest was increasing, likely indicating another temporary spike rather than sustained growth.
  * All interest spikes have occurred in the spring months, specifically February, suggesting a potential seasonal pattern.
  * This keyword requires further evaluation before it can be considered for further analysis.

**Recommended next steps:**
  1. **Further Evaluation:**
    * As a first step, simplify the keyword 'Monoculture in agriculture' to 'Monoculture' to potentially increase the number of search results and gain a broader perspective on interest levels.
    * Compile a new list of problems, ensuring each proposed problem keyword consists of no more than three words.
    * Repeat the analysis conducted here with the new list of problems.
    * If the new list yields more relevant results, use those for further analysis.
    * If the new list does not yield more relevant results, consider using 'Monoculture' for further analysis, but only after confirming the existence of either a significant upward trend in interest or sustained high interest.

**Links to Detailed Analysis:**
* Trend analysis for critical interest problems:[Analysis Critical Interest Problems](#trends-critical-problems)
* Visualizatioins for critical interest features: [Visualization Critical Interest Problems](#problems-critical-visualizations)

*Notes:*

* Interest is measured on a scale of 0 to 100, reflecting the relative popularity of a search term compared to all other Google searches by location over a given week. A value of 100 indicates peak popularity for the selected timeframe (here data has been collected over 3 years) and location (here worldwide). For more detailed information, refer to the [Basics of Google Trends](https://newsinitiative.withgoogle.com/resources/trainings/google-trends/basics-of-google-trends/#:~:text=Indexing%3A%20Google%20Trends%20data%20is,the%20time%20and%20location%20selected.)
* While the 'critical' keyword evaluated here has relatively low relevance to potential customers, it will still be included in further analysis to ensure comprehensive evaluation of all keyword groups. However, identifying alternative keywords with higher interest levels within this category would significantly enhance the value of subsequent analysis.










#### Interest Trend Analysis Results for Needs
* Visualize and summarize results found in trend analysis for needs.

**Visualization of Trends in all *Needs* Keywords**

In [None]:
create_heatmaps([moderate_relevance_df, high_relevance_df, critical_relevance_df_needs], ['Moderate', 'High', 'Critical'], ['Needs', 'Needs', 'Needs'])

**Insigths:**
1. **Insights for 'Sustainable food production':**
  * This need is classified as the only 'moderate' interest need in this group, with an average interest of 25 and a standard deviation of 10.
  * Interest in 'Sustainable food production' has steadily increased at a rate of 5.85 units per year over the last 3 years.
  * The average peak interest has more than doubled over the last 3 years.
  * The lowest interest value (0) was observed in 2021; trough values have consistently increased since then and are now close to peak values.
  * Currently, as of June 23, 2024, interest is at a moderate level of 25 and continues to grow.
  * A seasonal pattern can be observed, with interest typically dropping in the summer months (July-September).
2. **Insights for 'Educational AI':**
  * This need is classified as the only 'high' relevance need in this category.
  * Interest in 'Educational AI' has grown dramatically, increasing by 33.84 units over the last 3 years and continuing to grow.
  * The highest peaks in interest have increased by the factor of 5 over the last 3 years.
  * The observed rapid growth began in 2023, culminating in peak interest (100) in August 2024.
  * Despite a slight decrease from 100 to 78 in the last two months (April and June), interest remains very high at 78 as of June 23, 2024.
3. **Insights for 'Climate resilience':**
  * This need is classified as the only need of 'critical' relevance in this group of keywords.
  * Interest in 'Climate resilience' has grown consistently and rapidly at 18.02 units per year.
  * Average yearly peak interest has more than doubled in the last three years (increase by 147%).
  * The lowest observed interest (17 units) remians relatively high, and interest has increased throughout the entire measured period from June 2021 until June 2024.
  * Both peak and baseline interest levels have increased, indicating sustained growth in interest.
  * Currently, as of June 23, 2024, interest remains very high at 88 units and is still increasing.
  * The absence of seasonal patterns suggests that overall increased awareness and concern are the primary drivers of interest, rather than external events.

**Recommended next steps:**
  1. **Customer Segmentation Analysis:**
    * Features to consider for customer analysis, should either show sustained notable interest or a sustained growth in interest, needs to include in further analysis are:
      * Sustainable food production: Consistently growing interest and moderate overall interest.
      * Educational AI: Rapid growth and current very high interest.
      * Climate resilience: Rapid growth, mature interest and currently very high interest.
  2. **Prototype Development:**
    * It is too early to definitively decide which needs to include in the prototype design. However, some needs that demonstrated particularly promising results in the trend analysis can be prioritized for consideration. These include:
      * Educational AI: Rapidly growing interest and very high current interest.
      * Climate resilience: Rapidly growing over the last 3 years, and very high current interest.
  3. **Further Evaluation:**
    * Educational AI: Evaluate the cause of rapid growth in interest in 'Educational AI', use this to predict if rapid growth is likely to continue.
    * Climate resilience: Analyze the factors driving its rapid growth and predict future interest trends.

**Links to Detailed Analysis:**
* Trend analysis for moderate interest needs: [Analysis Moderate Interest Needs](#trends-moderate-needs)
* Visualizations for moderate interest needs: [Visualization Moderate Interest Needs](#visualizations-moderate-needs)
* Trend analysis for high interest needs: [Analysis High Interest Needs](#trends-high-needs)
* Visualizations for high interest needs: [Visualization High Interest Needs](#visualizations-high-needs)
* Trend analysis for critical interest needs: [Analysis Critical Interest Needs](#trends-critical-needs)
* Visualizatioins for critical interest needs: [Visualization Critical Interest Needs](#visualizations-critical-needs)

*Note:*

Interest is measured on a scale of 0 to 100, reflecting the relative popularity of a search term compared to all other Google searches by location over a given week. A value of 100 indicates peak popularity for the selected timeframe (here data has been collected over 3 years) and location (here worldwide). For more detailed information, refer to the [Basics of Google Trends](https://newsinitiative.withgoogle.com/resources/trainings/google-trends/basics-of-google-trends/#:~:text=Indexing%3A%20Google%20Trends%20data%20is,the%20time%20and%20location%20selected.)


<a name='contextual-exoloration-results'></a>
### Contextual Exploration Results
* Briefly evaluate how search suggestions for keywords relate to interest categories found.
* Present broader interest categories that most relevant keywords fit into.
* Visualize interest categories and their intersections, to determine overall most relevant interest category.

**Interactive Visualization of Keyword Relationships and Broader Categories**
* Use a chord diagram to visualize relationships between keywords and broader interest categories.
* Identify categories with many connections to determine key categories for customer segmentation analysis.
* Identify clusters of related keywords that are associated with the same categories.
* Visualize the areas of interest the prototype should address.

**Data Preparation**
* Create a representation of data that shows relationships between keywords and categories.

In [None]:
# Annotate all dataframes with the related keyword and combine all into one dataframe
keywords = ['Solar Powered', 'Plant Detection', 'Climate Resilience', 'Monoculture in Agriculture']
dfs= [categories_solar_powered, categories_plant_detection, categories_climate_resilience, categories_monoculture_in_agriculture]

# Add a new column to each of the dataframes indicating the associated keyword
for i, (keyword, df) in enumerate(zip(keywords, dfs)):
    df_copy = df.copy()  # Create a copy of the original dataframe
    df_copy['keyword'] = keyword
    dfs[i] = df_copy  # Update the dfs list with the modified DataFrame

# Combine all dataframes into one
combined_categories_df = pd.concat(dfs, ignore_index=True)
combined_categories_df.head(3)


In [None]:
# Create a dictionary to store node types
node_types = {}

# Extract all unique categories from the dataframes and mark them as 'Category'
for category in combined_categories_df['category_name'].unique():
    node_types[category] = 'Category'

# Add parent categories to the node types dictionary and mark them as 'Category'
for parent_categories in combined_categories_df['parent_categories']:
    for parent in parent_categories:
        node_types[parent] = 'Category'

# Add the keywords to the node types dictionary and mark them as 'Keyword'
for keyword in keywords:
    node_types[keyword] = 'Keyword'
node_types

In [None]:
# Convert the node types dictionary to a DataFrame for easier processing
node_types = pd.DataFrame(list(node_types.items()), columns=['Node', 'Type'])
node_types.head(3)

In [None]:
# Create a list of node names for indexing
node_names = node_types['Node'].tolist()
node_names[:3]

In [None]:
# Import numpy
import numpy as np

In [None]:
# Place all nodes in a square matrix to model relationships, initially all values are set to 0
matrix_size = len(node_names)
relation_matrix = pd.DataFrame(np.zeros((matrix_size, matrix_size)), index=node_names, columns=node_names)

# Fill the matrix with relationships
for index, row in combined_categories_df.iterrows():
    category = row['category_name']
    keyword = row['keyword']

    # Ensure both category and keyword are present in nodes
    if category in node_names and keyword in node_names:
        relation_matrix.loc[category, keyword] = 1
        relation_matrix.loc[keyword, category] = 1

    # Link category to each parent category
    for parent in row['parent_categories']:
        if category in node_names and parent in node_names:
            relation_matrix.loc[category, parent] = 1
            relation_matrix.loc[parent, category] = 1
relation_matrix.head(3)

In [None]:
# Import visualization libraries
import holoviews as hv
from holoviews import opts, dim

In [None]:
# Initialize empty list to store relationships in
edges = []

In [None]:
# Store all relationships as tuples of { source, target, weight, source_type, target_type } in a list
for index, row in relation_matrix.iterrows():
    for column, value in row.items():
        if value == 1:
          source_type = node_types.loc[node_types['Node'] == index, 'Type'].values[0]
          target_type = node_types.loc[node_types['Node'] == column, 'Type'].values[0]
          edges.append((index, column, value, source_type, target_type)) # Append a tuple of source, target and value(here boolean)

# Convert to dataframe
edges_df  = pd.DataFrame(edges, columns=['Source', 'Target', 'Link', 'SourceType', 'TargetType'])
edges_df.tail(5)

**Visualization**
* Click on one of the nodes to see the categories and or keywords it is related to.

In [None]:
# Enable rendering for holoview plots
%env HV_DOC_HTML=true

In [None]:
# Create HoloViews Dataset for edges
hv_data = hv.Dataset(edges_df, ['Source', 'Target'], 'Link')

In [None]:
# Create a HoloViews Dataset for nodes
hv_nodes = hv.Dataset(node_types, 'Node', 'Type')

In [None]:
# Enable bokeh extention
hv.extension('bokeh')

# Define a color map for node types
color_map = {'Keyword': '#d9b3ff', 'Category': '#a0d9d9'}

# Create the chord diagram
chord = hv.Chord((hv_data, hv_nodes)).opts(
    opts.Chord(
        labels='Node',
        node_color=dim('Type').categorize(color_map),  # Color nodes based on their type
        edge_color='Source',  # Color edges based on their source
        edge_cmap='Sunset',
        edge_alpha=0.7,  # Set edge transparency
        node_alpha=1.0,  # Set node transparency
        width=800,  # Width of the plot
        height=800,  # Height of the plot
        title='Keyword and Categories Relationships',
        fontsize={'title': 16},  # Set title font size,
        node_size=11,

    )
)


# Display the chord diagram
chord

**Insigths:**
1. **Insights for 'Solar powered':**
  * Google search suggestions for 'Solar powered' span both specific products and broader energy concepts, indicating a diverse audience with varied interests.
 * The most relevant categories for this proposed feature, 'Energy & Utilities', 'Renewable & Alternative Energy', and 'Nuclear Energy', suggest motivations ranging from product research to broader energy concerns and interests.
 * This diverse audience, ranging from those seeking practical solar-powered solutions to those interested in broader energy concepts, presents an opportunity to develop a solar-powered prototype with educational content. This content could address both practical needs (product operation) and informational needs (solar energy benefits, comparisons with other energy sources, and solar power conversion).

**Recommended next steps:**
1. **Customer Segmentation Analysis:**
    * Keywords to consider for customer analysis, should have a broad audience and be searched within a context relevant to the prototype. Keywords that meet these requirements are:
      * Solar powered: Potential customers show interest in both renewable energy solutions and specific solar-powered products.
  2. **Prototype Development:**
    * It is too early to definitively decide what to include in the prototype design process. However, some keywords that showed particularly promising results in the contextual analysis can be prioritized for consideration.
      * Solar powered: High and broad demand and interest in solar-powered products as well as educational ressources.
  3. **Further Evaluation:**
  

**Links to Detailed Analysis:**
* Exploration of search suggestions and categorization : [Exploration of Interest Context](#categories)

*Notes:*

* Only the keywords classified as 'critical' out of the *features*, *problems* and *needs* analysis were considered for this exploration.
* Suggestions were found by querying the Google Trends API, and correlate to the terms that would appear if you'd use the search-bar in the Google Search intereface (without the personalization). They are predictions that Google makes based on frequently searched terms. This provides insight into popular user queries and interests related to a search-term.
* Categories listed here are the same categories that Google Search uses to categorize search-queries. Google uses a nested dictionary to classify search-terms into categories.
* This is just an exploration of the context of searches, further contextual analysis for all relevant keywords is required to get more definitive insights about the target audience.