<a href="https://colab.research.google.com/github/eleanorjolliffe/Capstone-2025/blob/main/Capstone_Analysis_Part_2_Climate_misinformation_detection_using_Semantic_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Installing the necessary packages

In [None]:
!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util

import torch
import pandas as pd
import numpy as np
import textwrap
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import math

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

##Testing 4 SBERT models to evaluate their ability to detect climate misinformation

###Climate misinformation taxonomy and test statements

In [None]:
misinformation_examples = pd.read_csv('/content/Climate misinformation examples.csv')

In [None]:
#my test statements - 5 are climate misinfo, 3 are climate related and 2 are random
statements = [
    "I don't believe climate change is real - it is no warmer than it has been before and the ice isn't melting. It's honestly cold.",
    "NASA admits climate change occurs because of changes in Earth's solar orbit, not because of SUVS and fossil fuels",
    "Climate change isn’t actually a danger to our health.",
    "Electric cars are more environmentally devastating than petrol cars - they are charged on the nuclear power grid, they are encouraging lithium mining and their batteries degenerate.",
    "Leading meteorologist denounces wildfires as a fake climate con. They call for the arrest and imprisonment of Greta Thunberg and Sadiq Khan.",
    "This is absurd moral relativism. Did you not see Rhodes this year? Are you really a climate change denier in this day and age?",
    "Realistically, it's about bringing down the harmful particulates emitted by older diesel engines.",
    "But but... I work hard to afford a Range Rover so I can drive my kids to school.",
    "Poorer Londoners are significantly less likely to own a car and they're also much more likely to be worst affected by air pollution.",
    "Maybe in 70 years, they’ll admit the Covid jab was a dangerous Experiment"
]

### Function for semantic similarity testing & heatmap plotting

The function below has three inputs, the sentence transformer model, climate misinformation examples and test statements. It first applies the model to calculate dense vectors for each climate misinformation statement, and then it calculates the dense vectors of each test statement. Next, using cosine similarity, the distance between each test statement and each climate misinformation statement is calculated. Next, I iterated through the results to find the highest score for each statement and the climate misinformation example it was most highly matched to, saving these in an empty list created earlier.

In [None]:
def misinformation_testing(model, misinformation_examples_list, test_statements):
    misinformation_embeddings = model.encode(misinformation_examples_list, convert_to_tensor=True)
    test_embeddings = model.encode(test_statements, convert_to_tensor=True)

    similarity_matrix = util.cos_sim(test_embeddings, misinformation_embeddings)

    results = []
    for i, sims in enumerate(similarity_matrix):
        best_idx = sims.argmax().item()
        best_score = sims[best_idx].item()
        best_example = misinformation_examples_list[best_idx]
        results.append({
            "Test Statement": test_statements[i],
            "Most Similar Example": best_example,
            "Similarity Score": best_score
        })

    return pd.DataFrame(results)

misinformation_examples_list = misinformation_examples['Climate misinformation example'].tolist()

Using plotly, the function below visualises the semantic similarity scores calculated above. It first makes a copy of the dataframe which is used to wrap the statements making the graph more readable as statements are over multiple lines. Other features include a horizontal orientation, a colour which reflects the strength of the similarity score and hover data which includes the most similar example.


In [None]:
def plot_misinformation_similarity_plotly(df_results, title="Climate Misinformation Similarity", wrap_width=40):
    df = df_results.copy()
    df["Wrapped Statement"] = df["Test Statement"].apply(lambda x: "<br>".join(textwrap.wrap(x, width=wrap_width)))

    fig = px.bar(
        df,
        x="Similarity Score",
        y="Wrapped Statement",
        orientation="h",
        title=title,
        color="Similarity Score",
        color_continuous_scale=px.colors.sequential.Blues,
        hover_data  =["Most Similar Example"],
        labels = {"Similarity Score": "Semantic Similarity Score"}
    )

    fig.update_layout(
        xaxis_title="Semantic Similarity Score",
        yaxis_title="Test Statement",
        yaxis=dict(autorange="reversed"),
        title_x=0.5,
        font=dict(size=6),
    )

    fig.show()


### all-MiniLM-L6-v2

In [None]:
#loading the all mini lm sentence transformer
model_all_Mini = SentenceTransformer('all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
#applying the misinformation testing function to calculate semantic similarity between climate misinfo examples and test statements
df_MiniLM_results = misinformation_testing(model_all_Mini, misinformation_examples_list, statements)

In [None]:
plot_misinformation_similarity_plotly(df_MiniLM_results, title ='Mini LM CLimate misinformation test')

### paraphrase-MiniLM-L12-v2

In [None]:
#loading the paraphrase mini lm sentence transformer
model_para = SentenceTransformer('paraphrase-MiniLM-L12-v2')

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.52k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/631 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
#applying the misinformation testing function to calculate semantic similarity between climate misinfo examples and test statements
df_para_MiniLM_results = misinformation_testing(model_para, misinformation_examples_list, statements)

In [None]:
plot_misinformation_similarity_plotly(df_para_MiniLM_results, title ='Paraphrase Mini LM Climate Misinformation test')

###paraphrase-mpnet-base-v2

In [None]:
#loading the paraphrase mpnet sentence transformer
model_mpnet = SentenceTransformer('paraphrase-mpnet-base-v2')

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.52k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
#applying the misinformation testing function to calculate semantic similarity between climate misinfo examples and test statements
df_para_mpnet_results = misinformation_testing(model_mpnet, misinformation_examples_list, statements)

In [None]:
plot_misinformation_similarity_plotly(df_para_mpnet_results, title="Paraphrase mpnet base Climate Misinformation test")

###all-mpnet-base-v2

In [None]:
#loading the all mpnet sentence transformer
model_all_mpnet = SentenceTransformer('all-mpnet-base-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
#applying the misinformation testing function to calculate semantic similarity between climate misinfo examples and test statements
df_all_mpnet_results = misinformation_testing(model_all_mpnet, misinformation_examples_list, statements)

In [None]:
plot_misinformation_similarity_plotly(df_all_mpnet_results, title="All mpnet base Climate Misinformation test")

## Applying chosen model, all mini lm, to random batch samples of Reddit & Telegram data

The following section applies the chosen sentence transformer, all mini lm, to random 50-message-long batches of Reddit and Telegram data to flag any issues early. E.g., persistent false positives or negatives.

In [None]:
#applying all mini lm to calculate the vectors on the climate misinformation example
all_Mini_example_embeddings = model_all_Mini.encode(misinformation_examples['Climate misinformation example'], convert_to_tensor=True)

In [None]:
telegram = pd.read_csv('/content/final_clean_telegram_ulez.csv')
reddit = pd.read_csv('/content/final_clean_reddit_ulez.csv')

The code below was applied to random samples of 50 messages from anywhere within my datasets (ten are shown here as an example). By examining the most similar messages for random sentences, I was able to identify why some false negatives and positives were occurring and adapt my method accordingly. Notably, false positives occurred when general climate discourse was assigned high scores. To address this, I created examples of neutral climate discourse and added them to the dataset to detect if a message actually scored higher than neutral climate discourse. False negatives also occurred as climate misinformation on Telegram often linked the loss of freedom to climate policy, something which was not yet reflected in my examples. To deal with this, I added more examples reflecting that sentiment. This process also helped confirm that using average similarity provided little help in my analysis, as the range of climate misinformation means that even if something is considered climate misinformation, it is not similar to a significant portion of climate misinformation examples. So, the average essentially flattens out helpful information.

As for the code, a for loop is used to look through a specified range within the dataset, computing the cosine similarity between these random sentences and the climate misinformation example dataset. The average similarity, top similarity and top sentence it was matched to were saved and printed for close inspection.

In [None]:
for statement in telegram['clean_message'][20000:20010]:
    statement_embedding = model_all_Mini.encode(statement, convert_to_tensor=True)
    cosine_scores = util.cos_sim(statement_embedding, all_Mini_example_embeddings)[0]

    avg_similarity = cosine_scores.mean().item()
    max_similarity = cosine_scores.max().item()
    max_idx = cosine_scores.argmax().item()
    best_match = misinformation_examples['Climate misinformation example'][max_idx]

    print(f"\nStatement: {statement}")
    print(f"Average Similarity: {avg_similarity:.4f}")
    print(f"Highest Similarity: {max_similarity:.4f}")
    print(f"Best Matching Taxonomy Sentence: '{best_match}'")


Statement: abomination in action by the metropolitan police arresting ulez protestors purely because the protest was near to khans house...this is war
Average Similarity: 0.0532
Highest Similarity: 0.2470
Best Matching Taxonomy Sentence: 'IPCC is alarmist'

Statement: secret courts, secret justice sysytem, unlawful prosecutions and unlawful convictions.. time to close down the justice system as it does not deal justice it deals summary judgement with no legal recourse..kangeroo courts are illegal in the uk as are secret unlawful procedures.. they run of risk of a backlash of epic proportions that would end this rule of law breach forever.. this is not acceptable..
Average Similarity: -0.0128
Highest Similarity: 0.3116
Best Matching Taxonomy Sentence: 'Net Zero laws will turn regular citizens into criminals overnight'

Statement: ministry if injustice more like this shit is entirely unconstitutionally unlawful
Average Similarity: 0.0063
Highest Similarity: 0.3202
Best Matching Taxonomy

##Applying the iterated taxonomy to the entire dataset

In [None]:
climate_misinfo_neutral_examples = pd.read_csv('/content/Final climate misinformation : neutral examples.csv')
telegram = pd.read_csv('/content/final_clean_telegram_ulez.csv')
reddit = pd.read_csv('/content/final_clean_reddit_ulez.csv')

In [None]:
#applying all mini lm to calculate the vectors on the climate misinformation and neutral discourse examples
all_example_embeddings = model_all_Mini.encode(climate_misinfo_neutral_examples['Climate misinformation/ neutral example'], convert_to_tensor=True)

The function below applies the sentence transformer model, all mini LM, to calculate the semantic similarity between every message in the DataFrame and the predefined climate misinformation and neutral statements. It first converts the DataFrame inputs into a list of strings and then calculates dense vectors for each message. Next, lists are initialised to store the results, including the highest similarity score, the most similar example message and the predicted class, which is either neutral or misinfo. Next, the cosine similarity between every message and every climate misinfo / neutral discourse example is calculated, and results are saved in the initialised lists. Finally, results are stored in a DataFrame.

In [None]:
def compute_similarity(df, text_column):
    message_embeddings = model_all_Mini.encode(df[text_column].tolist(), convert_to_tensor=True)

    highest_semantic_similarity_score = []
    most_similar_phrases = []
    predicted_class = []

    for i, message_embedding in enumerate(message_embeddings):
        cosine_scores = util.cos_sim(message_embedding, all_example_embeddings)[0]

        best_idx = int(np.argmax(cosine_scores))
        best_score = float(cosine_scores[best_idx])
        best_phrase = climate_misinfo_neutral_examples['Climate misinformation/ neutral example'][best_idx]
        best_label = climate_misinfo_neutral_examples['Climate misinformation / neutral'][best_idx]

        highest_semantic_similarity_score.append(best_score)
        most_similar_phrases.append(best_phrase)
        predicted_class.append(best_label)

    df['Highest semantic similarity score'] = highest_semantic_similarity_score
    df['Most similar phrase'] = most_similar_phrases
    df['Misinfo / Neutral'] = predicted_class

    return df


In [None]:
#applying the function to all the telegram data
telegram_semantic_similarity = compute_similarity(telegram, text_column='clean_message')

In [None]:
#applying the function to all the telegram data
reddit_semantic_similarity = compute_similarity(reddit, text_column='clean_message')

In [None]:
#labelling everything with a similarity score of under 0.55 neutral
reddit_semantic_similarity.loc[reddit_semantic_similarity['Highest semantic similarity score'] < 0.55, 'Misinfo / Neutral'] = 'neutral'
telegram_semantic_similarity.loc[telegram_semantic_similarity['Highest semantic similarity score'] < 0.55, 'Misinfo / Neutral'] = 'neutral'

In [None]:
from google.colab import files
telegram_semantic_similarity.to_csv('telegram_sem_similarity.csv', index=False)
files.download('telegram_sem_similarity.csv')

reddit_semantic_similarity.to_csv('reddit_sem_similarity.csv', index=False)
files.download('reddit_sem_similarity.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Testing accuracy using F1 score

The function first sorts the results by the highest semantic similarity score and then creates a new dataframe with the top 50 messages from each quartile.

In [None]:
def top_n_from_quartiles(df, sort_by, n=50):
    df_sorted = df.sort_values(by=sort_by,  ascending=False).reset_index(drop=True)

    rows_per_quartile = len(df_sorted) // 4

    q1 = df_sorted.iloc[0:rows_per_quartile].head(n)
    q2 = df_sorted.iloc[rows_per_quartile:2*rows_per_quartile].head(n)
    q3 = df_sorted.iloc[2*rows_per_quartile:3*rows_per_quartile].head(n)
    q4 = df_sorted.iloc[3*rows_per_quartile:].head(n)

    return pd.concat([q1, q2, q3, q4], ignore_index=True)

In [None]:
#creating a new dataframe with the top 50 messages from every quartile
reddit_top50_pq = top_n_from_quartiles(reddit_semantic_similarity, sort_by='Highest semantic similarity score', n=50)
telegram_top50_pq = top_n_from_quartiles(telegram_semantic_similarity, sort_by='Highest semantic similarity score', n=50)

In [None]:
#creating a new column which assigns messages labelled misinfo 1 and those labelled neutral 0
reddit_top50_pq['Binary number'] = reddit_top50_pq['Misinfo / Neutral'].apply(lambda x: 1 if x == 'misinfo' else 0)
telegram_top50_pq['Binary number'] = telegram_top50_pq['Misinfo / Neutral'].apply(lambda x: 1 if x == 'misinfo' else 0)

In [None]:
reddit_top50_pq.to_csv('Reddit top 50 per quartile.csv', index=False)
files.download('Reddit top 50 per quartile.csv')

telegram_top50_pq.to_csv('Telegram top 50 per quartile.csv', index=False)
files.download('Telegram top 50 per quartile.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

After downloading the CSV, I manually reviewed every message to verify whether it was assigned the correct value. I recorded the correct value in a new column 'Real number'. I then concatenated the top 50 messages from Reddit and Telegram, per quartile, with my manually added 'Real number' column. This ensured only one classification report was produced.


In [None]:
r_labelled = pd.read_csv('/content/Reddit manually labelled top 50 per quartile.csv')
t_labelled = pd.read_csv('/content/Telegram manually labelled top 50 per quartile.csv')
combined_t_r_labelled = pd.concat([r_labelled, t_labelled], ignore_index=True)

Below, I evaluate the performance of the climate misinformation detection model by first importing the necessary function. Next, I assign y_true the column I manually generated, ‘Real value’, which has the correct labels for every message: 1 for misinformation and 0 for neutral. I also assign y_pred the model-generated results, ‘Binary number, again 1 for misinformation and 0 for neutral. Finally, I generate a classification report which calculates and prints the following metrics for each class (misinformation and neutral): precision, recall, F1-score, and support.

In [None]:
from sklearn.metrics import classification_report
y_true = combined_t_r_labelled['Real value']
y_pred = combined_t_r_labelled['Binary number']
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97       341
           1       0.93      0.73      0.82        59

    accuracy                           0.95       400
   macro avg       0.94      0.86      0.90       400
weighted avg       0.95      0.95      0.95       400



## Comparative analysis

### How prevalent is climate misinformation across the two platforms

In [None]:
reddit_ss = pd.read_csv('/content/FINAL reddit_sem_similarity (2).csv')
telegram_ss = pd.read_csv('/content/FINAL telegram_sem_similarity (2).csv')

In [None]:
#calculating the percentage of discourse climate misinformatoin accounts for within the two platforms
misinfo_counts_r = reddit_ss['Misinfo / Neutral'].value_counts()
misinfo_count_r = misinfo_counts_r.get('misinfo', 0)
percent_misinfo_r = (misinfo_count_r / len(reddit_ss)) * 100

misinfo_counts_t = telegram_ss['Misinfo / Neutral'].value_counts()
misinfo_count_t = misinfo_counts_t.get('misinfo', 0)
percent_misinfo_t = (misinfo_count_t / len(telegram_ss)) * 100

In [None]:
platform_misinfo = pd.DataFrame({'Platform': ['Reddit', 'Telegram'], '% Climate misinfo': [percent_misinfo_r, percent_misinfo_t]})
fig = px.bar(platform_misinfo, x='Platform', y='% Climate misinfo', title = '% of climate misinformation within the discourse across platforms')
fig.show()

###Who spreads misinformation and how often?

In [None]:
reddit_spreaders = reddit_ss[reddit_ss['Misinfo / Neutral'] == 'misinfo']['hashed_user'].value_counts()
telegram_spreaders = telegram_ss[telegram_ss['Misinfo / Neutral'] == 'misinfo']['hashed_user'].value_counts()

The code below visualises how much climate misinformation the top 20 spreaders spread (on Reddit and Telegram). The code first sorts the spreaders’ dataframes in descending order and then selects the top twenty using head(). This is saved in a new DataFrame which is then used for visualisation using plotly.


In [None]:
top_reddit_spreaders = reddit_spreaders.sort_values(ascending=False).head(20)
reddit_df = pd.DataFrame({
    'Count': top_reddit_spreaders.values
})

fig_reddit = px.bar(reddit_df,
                    y='Count',
                    title='Top 20 Users Spreading Climate Misinfo on Reddit',
                    labels={'Count': 'No. of Climate Misinfo Messages'},
                    color_discrete_sequence=['darkblue'])
fig_reddit.show()

top_telegram_spreaders = telegram_spreaders.sort_values(ascending=False).head(20)
telegram_df = pd.DataFrame({
    'Count': top_telegram_spreaders.values
})

fig_telegram = px.bar(telegram_df,
                      y='Count',
                      title='Top 20 Users Spreading Climate Misinfo on Telegram',
                      labels={'Count': 'No. of Climate Misinfo Messages'},
                      color_discrete_sequence=['darkblue'])
fig_telegram.show()

In [None]:
#this code uses various techniques to calculate some basic statistics, which illustrate the spread of climate misinformation
misinfo_reddit = reddit_ss[reddit_ss['Misinfo / Neutral'] == 'misinfo']
misinfo_telegram = telegram_ss[telegram_ss['Misinfo / Neutral'] == 'misinfo']

top_reddit_spreader_count = reddit_spreaders.iloc[0]
top_telegram_spreader_count = telegram_spreaders.iloc[0:3]

total_reddit_misinfo_count = len(misinfo_reddit)
total_telegram_misinfo_count = len(misinfo_telegram)

percentage_misinfo_top_reddit_spreader = (top_reddit_spreader_count / total_reddit_misinfo_count) * 100
percentage_misinfo_top_telegram_spreaders = (top_telegram_spreader_count / total_telegram_misinfo_count) * 100

unique_misinfo_users_r = misinfo_reddit['hashed_user'].nunique()
unique_misinfo_users_t = misinfo_telegram['hashed_user'].nunique()

unique_ulez_users_r = reddit_ss['hashed_user'].nunique()
unique_ulez_users_t = telegram_ss['hashed_user'].nunique()


print(f"The single top climate misinformation spreader on Reddit contributes to {percentage_misinfo_top_reddit_spreader:.2f}% "
      f"of climate misinformation on the platform, out of {unique_misinfo_users_r} users who spread climate misinformation "
      f"and {unique_ulez_users_r} users who posted about ULEZ overall.")

print(f"The 6 top climate misinformation spreaders on Telegram contributes to {percentage_misinfo_top_telegram_spreaders.sum():.2f}% "
      f"of climate misinformation on the platform, out of {unique_misinfo_users_t} users who spread climate misinformation "
      f"and {unique_ulez_users_t} users who posted about ULEZ overall.")

The single top climate misinformation spreader on Reddit contributes to 15.62% of climate misinformation on the platform, out of 248 users who spread climate misinformation and 11792 users who posted about ULEZ overall.
The 6 top climate misinformation spreaders on Telegram contributes to 34.56% of climate misinformation on the platform, out of 96 users who spread climate misinformation and 943 users who posted about ULEZ overall.


###What subreddits / telegram channels are misinfo hotbeds?

In [None]:
# Calculating the percentage of climate misinformation on each subreddit
subreddit_spreaders = misinfo_reddit['subreddit'].value_counts()
subreddit_freq = reddit_ss['subreddit'].value_counts()

common_subreddits = subreddit_spreaders.index.intersection(subreddit_freq.index)
subreddit_misinfo_pct = (subreddit_spreaders[common_subreddits] / subreddit_freq[common_subreddits]) * 100
subreddit_misinfo_pct = subreddit_misinfo_pct.sort_values(ascending=False)

In [None]:
#using plotly to visualise the results
df_subreddit_misinfo = subreddit_misinfo_pct.reset_index()
df_subreddit_misinfo.columns = ['Subreddit', 'Percentage of Climate Misinfo Posts']

fig = px.bar(df_subreddit_misinfo,
             x='Subreddit', y='Percentage of Climate Misinfo Posts',
             title='Percentage of Climate Misinfo Posts Across Subreddits',
             labels={'Percentage of Climate Misinfo Posts': '% of Posts'},
             color_discrete_sequence=['indianred'])

fig.update_layout(xaxis_tickangle=-45)
fig.show()

In [None]:
# Calculating the percentage of climate misinformation on each telegram channel
telegram_c_spreaders = misinfo_telegram['group_name'].value_counts()
t_c_freq = telegram_ss['group_name'].value_counts()

common_channels = telegram_c_spreaders.index.intersection(t_c_freq.index)
channel_misinfo_pct = (telegram_c_spreaders[common_channels] / t_c_freq[common_channels]) * 100
channel_misinfo_pct = channel_misinfo_pct.sort_values(ascending=False)

In [None]:
#using plotly to visualise the results (largely the same as above, aside from the fact that I create a function which splits the text roughly
#into thirds due to the lost channel names, which ensures readability on the graph)

df_channel_misinfo = channel_misinfo_pct.reset_index()
df_channel_misinfo.columns = ['Channel', 'Percentage of Climate Misinfo Posts']
def split_in_half(text):
    words = text.split()
    half = math.ceil(len(words) / 3)
    return ' '.join(words[:half]) + '<br>' + ' '.join(words[half:])

df_channel_misinfo['Channel_wrapped'] = df_channel_misinfo['Channel'].apply(split_in_half)

fig = px.bar(df_channel_misinfo,
             x='Channel_wrapped', y='Percentage of Climate Misinfo Posts',
             title='Percentage of Climate Misinfo Posts Across Telegram Channels',
             labels={'Percentage of Climate Misinfo Posts': '% of Posts'},
             color_discrete_sequence=['indianred'])

fig.update_layout(
    xaxis_tickfont=dict(size=7),
    title_font_size=18,
    yaxis_title='% of Posts',
    xaxis_title='Channel',
)
fig.show()