Summary

The provided code script reads text data from different categories (business, entertainment, politics, sports, tech) and their corresponding summaries. It prepares DataFrames to organize the text and summary data. The script then performs fuzzy text matching between the text and summaries, merging them into a single DataFrame (`merged_df`) for matched pairs. It also reads external data from CSV files and stores unique class values. Overall, the script performs data preparation, text matching, and data merging tasks for further analysis.

In [1]:
import numpy as np 
import pandas as pd 
import os
import glob
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [2]:
business = []
entertainment = []
politics = []
sport = []
tech = []

label_list = []
text_list = []

In [3]:
for category in os.listdir('../data/BBC_News_Summary/News_Articles/'):
    if category[0]!= '.':
        for text in os.listdir(f'../data/BBC_News_Summary/News_Articles/{category}'):
            print(text,category)
            encoding = "utf-8-sig"  # Specify the desired encoding
            try:
                with open(f'../data/BBC_News_Summary/News_Articles/{category}/{text}', 'r', encoding=encoding) as file:
                    text = file.read()
                    print(text)
                    if category == 'business':
                        business.append(f'{text}')
                    elif category == 'entertainment':
                        entertainment.append(f'{text}')
                    elif category == 'politics':
                        politics.append(f'{text}')
                    elif category == 'sport':
                        sport.append(f'{text}')
                    elif category == 'tech':
                        tech.append(f'{text}')
            except UnicodeDecodeError:
                with open(f'../data/BBC_News_Summary/News_Articles/{category}/{text}', 'r', encoding="latin-1") as file:
                    text = file.read()
                    print(text)
                    text = file.read()
                    print(text)
                    if category == 'business':
                        business.append(f'{text}')
                    elif category == 'entertainment':
                        entertainment.append(f'{text}')
                    elif category == 'politics':
                        politics.append(f'{text}')
                    elif category == 'sport':
                        sport.append(f'{text}')
                    elif category == 'tech':
                        tech.append(f'{text}')
    else:
        continue

289.txt entertainment
Musicians to tackle US red tape

Musicians' groups are to tackle US visa regulations which are blamed for hindering British acts' chances of succeeding across the Atlantic.

A singer hoping to perform in the US can expect to pay $1,300 (£680) simply for obtaining a visa. Groups including the Musicians' Union are calling for an end to the "raw deal" faced by British performers. US acts are not faced with comparable expense and bureaucracy when visiting the UK for promotional purposes.

Nigel McCune from the Musicians' Union said British musicians are "disadvantaged" compared to their US counterparts. A sponsor has to make a petition on their behalf, which is a form amounting to nearly 30 pages, while musicians face tougher regulations than athletes and journalists. "If you make a mistake on your form, you risk a five-year ban and thus the ability to further your career," says Mr McCune.

"The US is the world's biggest music market, which means something has to be d

In [5]:
list_data_each = [business,entertainment,politics,sport,tech]
transposed_list_each = list(map(list, zip(*list_data_each)))
list_data_all = [text_list,label_list]
transposed_list_all = list(map(list, zip(*list_data_all)))
df_each = pd.DataFrame(transposed_list_each, columns=['business', 'entertainment', 'politics','sport','tech'])
df_all = pd.DataFrame(transposed_list_all, columns=['text','label'])

In [6]:
final_df = pd.DataFrame(columns=['Text', 'Category'])

for column in df_each.columns:
    column_content = df_each[column]
    
    df_temp = pd.DataFrame({'Text': column_content, 'Category': column})
    
    final_df = pd.concat([final_df, df_temp], ignore_index=True)

print(final_df)



                                                   Text  Category
0     UK economy facing 'major risks'\n\nThe UK manu...  business
1     Aids and climate top Davos agenda\n\nClimate c...  business
2     Asian quake hits European shares\n\nShares in ...  business
3     India power shares jump on debut\n\nShares in ...  business
4     Lacroix label bought by US firm\n\nLuxury good...  business
...                                                 ...       ...
1925  Cheaper chip for mobiles\n\nA mobile phone chi...      tech
1926  Progress on new internet domains\n\nBy early 2...      tech
1927  Slim PlayStation triples sales\n\nSony PlaySta...      tech
1928  Loyalty cards idea for TV addicts\n\nViewers c...      tech
1929  Apple iPod family expands market\n\nApple has ...      tech

[1930 rows x 2 columns]


In [7]:
import os

def count_text_files(folder_path):
    count = 0
    if not os.path.exists(folder_path):
        print(f"Folder '{folder_path}' does not exist.")
        return count
    
    subfolders = ["business", "entertainment", "politics", "sports", "tech"]
    
    for subfolder in subfolders:
        subfolder_path = os.path.join(folder_path, subfolder)
        if not os.path.exists(subfolder_path):
            print(f"Subfolder '{subfolder_path}' does not exist.")
            continue
        
        for file_name in os.listdir(subfolder_path):
            file_path = os.path.join(subfolder_path, file_name)
            if file_name.endswith(".txt") and os.path.isfile(file_path):
                count += 1
    
    return count


In [8]:
folder_path = "/Users/bgrnaymane/Documents/GitHub/Projektrealisierung_Gruppe5/data/BBC_News_Summary/News_Articles"
text_file_count = count_text_files(folder_path)
print("Number of text files:", text_file_count)

Subfolder '/Users/bgrnaymane/Documents/GitHub/Projektrealisierung_Gruppe5/data/BBC_News_Summary/News_Articles/sports' does not exist.
Number of text files: 1714


In [9]:
final_df.head()
final_df_random = final_df.sample(frac=1, random_state=42).reset_index(drop=True)

In [10]:
final_df

Unnamed: 0,Text,Category
0,UK economy facing 'major risks'\n\nThe UK manu...,business
1,Aids and climate top Davos agenda\n\nClimate c...,business
2,Asian quake hits European shares\n\nShares in ...,business
3,India power shares jump on debut\n\nShares in ...,business
4,Lacroix label bought by US firm\n\nLuxury good...,business
...,...,...
1925,Cheaper chip for mobiles\n\nA mobile phone chi...,tech
1926,Progress on new internet domains\n\nBy early 2...,tech
1927,Slim PlayStation triples sales\n\nSony PlaySta...,tech
1928,Loyalty cards idea for TV addicts\n\nViewers c...,tech


In [11]:
business_sum = []
entertainment_sum = []
politics_sum = []
sport_sum = []
tech_sum = []

category_list_sum = []
summary_list_sum = []

In [12]:
for category in os.listdir('../data/BBC_News_Summary/Summaries/'):
    if category[0]!= '.':
        for text in os.listdir(f'../data/BBC_News_Summary/Summaries/{category}'):
            print(text,category)
            encoding = "utf-8-sig"  # Specify the desired encoding
            try:
                with open(f'../data/BBC_News_Summary/Summaries/{category}/{text}', 'r', encoding=encoding) as file:
                    text = file.read()
                    print(text)
                    if category == 'business':
                        business_sum.append(f'{text}')
                    elif category == 'entertainment':
                        entertainment_sum.append(f'{text}')
                    elif category == 'politics':
                        politics_sum.append(f'{text}')
                    elif category == 'sport':
                        sport_sum.append(f'{text}')
                    elif category == 'tech':
                        tech_sum.append(f'{text}')
            except UnicodeDecodeError:
                with open(f'../data/BBC_News_Summary/Summaries/{category}/{text}', 'r', encoding="latin-1") as file:
                    text = file.read()
                    print(text)
                    text = file.read()
                    print(text)
                    if category == 'business':
                        business_sum.append(f'{text}')
                    elif category == 'entertainment':
                        entertainment_sum.append(f'{text}')
                    elif category == 'politics':
                        politics_sum.append(f'{text}')
                    elif category == 'sport':
                        sport_sum.append(f'{text}')
                    elif category == 'tech':
                        tech_sum.append(f'{text}')
    else:
        continue

289.txt entertainment
Nigel McCune from the Musicians' Union said British musicians are "disadvantaged" compared to their US counterparts.A US Embassy spokesman said: "We are aware that entertainers require visas for time-specific visas and are doing everything we can to process those applications speedily."The Musicians' Union stance is being endorsed by the Music Managers' Forum (MMF), who say British artists face "an uphill struggle" to succeed in the US, thanks to the tough visa requirements, which are also seen as impractical.Musicians' groups are to tackle US visa regulations which are blamed for hindering British acts' chances of succeeding across the Atlantic."The US is the world's biggest music market, which means something has to be done about the creaky bureaucracy," says Mr McCune."The current situation is preventing British acts from maintaining momentum and developing in the US," he added.A singer hoping to perform in the US can expect to pay $1,300 (£680) simply for obta

In [14]:
list_data_each_sum = [business_sum,entertainment_sum,politics_sum,sport_sum,tech_sum]
transposed_list_each_sum = list(map(list, zip(*list_data_each_sum)))
list_data_all_sum = [summary_list_sum,category_list_sum]
transposed_list_all_sum = list(map(list, zip(*list_data_all_sum)))
df_each_sum = pd.DataFrame(transposed_list_each_sum, columns=['business', 'entertainment', 'politics','sport','tech'])
df_all_sum = pd.DataFrame(transposed_list_all_sum, columns=['summary','category'])

In [None]:
import pandas as pd

folder_path = '/Users/bgrnaymane/Documents/GitHub/Projektrealisierung_Gruppe5/data/BBC_News_Summary/Merged_df'

file_name = 'merged_df.csv'

file_path = folder_path + '/' + file_name

final_df.to_csv(file_path, index=False)

print(f"DataFrame saved as CSV file at: {file_path}")

In [16]:
final_df_sum = pd.DataFrame(columns=['Summary', 'Category'])

for column in df_each_sum.columns:
    column_content = df_each_sum[column]
    
    df_temp = pd.DataFrame({'Summary': column_content, 'Category': column})
    
    final_df_sum = pd.concat([final_df_sum, df_temp], ignore_index=True)

print(final_df_sum)

                                                Summary  Category
0     "Despite some positive news for the export sec...  business
1     At the same time, about 100,000 people are exp...  business
2     The unfolding scale of the disaster in south A...  business
3     Shares in India's largest power producer, Nati...  business
4     LVMH said the French designer's haute couture ...  business
...                                                 ...       ...
1925  Texas, which makes computer chips for more tha...      tech
1926  By early 2005 the net could have two new domai...      tech
1927  The title broke the UK sales record for video ...      tech
1928  Viewers could soon be rewarded for watching TV...      tech
1929  The IFPI industry body said that the popularit...      tech

[1930 rows x 2 columns]


In [20]:
final_df_sum.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1930 entries, 0 to 1929
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Summary   1930 non-null   object
 1   Category  1930 non-null   object
dtypes: object(2)
memory usage: 30.3+ KB


In [19]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1930 entries, 0 to 1929
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Text      1930 non-null   object
 1   Category  1930 non-null   object
dtypes: object(2)
memory usage: 30.3+ KB


In [29]:
from fuzzywuzzy import fuzz

merged_df = pd.DataFrame(columns=['text', 'summary', 'category'])

for idx, row1 in final_df.iterrows():
    text = row1['Text']
    category = row1['Category']
    
    for _, row2 in final_df_sum.iterrows():
        summary = row2['Summary']
        column = row2['Category']
        
        similarity = fuzz.token_set_ratio(text, summary)
        
        if similarity >= 80:  
            merged_df = pd.concat([merged_df, pd.DataFrame({'text': [text], 'summary': [summary], 'category': [category]})])

print(merged_df)


KeyboardInterrupt: 

In [32]:
test = pd.read_csv('../data/Blogs_result/dataset.csv')
test.head()
unique_values = test['Class'].unique()
print(unique_values)


['Political speech' 'News' 'Jurisdiction' 'Literature' 'Blog']


In [None]:
test = pd.read_csv('../data/BBC_News_Summary/Merged_df/merged_df.csv')