# CS450.4 Final Project - adhavle - Classifying Partisan Bias in News Articles
- I am attempting to replicate some of the methods used in [Classifying Partisan Bias in News Articles:
 Leveraging an Understanding of Political Language
 and Article Structure](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/final-reports/final-report-169502805.pdf), which uses a dataset from a competition on detecting hyperpartisan and fake news.
- Dataset location [Data for PAN at SemEval 2019 Task 4: Hyperpartisan News Detection](https://zenodo.org/records/1489920)
- Also see [Hyperpartisan News Detection 2019](https://pan.webis.de/semeval19/semeval19-web/#data) and [SemEval-2019 Task 4: Hyperpartisan News Detection](https://aclanthology.org/S19-2145.pdf)


In [1]:
import os
import time
import pandas as pd
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer

In [8]:
# set data locations
dataset_dir = os.path.join(os.getcwd(), "dataset")
training_data_file = os.path.join(dataset_dir, "articles-validation-bypublisher-20181122-html-escaped.xml")
target_data_file = os.path.join(dataset_dir, "ground-truth-validation-bypublisher-20181122.xml")

In [9]:
datadf = pd.read_xml(path_or_buffer = training_data_file)
targetdf = pd.read_xml(path_or_buffer = target_data_file)
df = pd.concat([datadf, targetdf], axis = 1)
df.columns = ['id', 'published-at', 'title', 'article', 'id2', 'hyperpartisan', 'bias',
       'url', 'labeled-by']

In [10]:
# run some test cases to ensure the data is good
if len(df) == 150_000:
    print(f"PASS: dataframe has {len(df)} records as expected")
else:
    print(f"FAIL: dataframe has {len(df)} records - expected 150,000")

def validate_column_does_not_have_null_values(column_name):
    if df[column_name].isnull().sum() == 0:
        print(f"PASS: no null values detected for column '{column_name}'")
    else:
        print(f"FAIL: {df[column_name].isnull().sum()} null values for column '{column_name}' not expected")

validate_column_does_not_have_null_values("id")
validate_column_does_not_have_null_values("id2")
validate_column_does_not_have_null_values("article")
validate_column_does_not_have_null_values("bias")

id_matches = df['id'] == df['id2']
n_not_matched = id_matches.value_counts().get(False, 0)
n_matched = id_matches.value_counts().get(True, 0)
if n_matched == 150_000 and n_not_matched == 0:
    print(f"PASS: all article IDs from training file and target file matched (id == id2 for all records)")
else:
    print(f"FAIL: {n_matched} article IDs from training file matched, BUT {n_not_matched} article IDs did not match")

PASS: dataframe has 150000 records as expected
PASS: no null values detected for column 'id'
PASS: no null values detected for column 'id2'
PASS: no null values detected for column 'article'
PASS: no null values detected for column 'bias'
PASS: all article IDs from training file and target file matched (id == id2 for all records)


### Comments
The dataset is loaded along with target values. The `id2` column has served its purpose in validating the join with the target data and can be dropped going forward.

In [11]:
df = df.drop('id2', axis=1)
df.info()
df.head()
df.tail()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   id             150000 non-null  int64 
 1   published-at   100492 non-null  object
 2   title          137723 non-null  object
 3   article        150000 non-null  object
 4   hyperpartisan  150000 non-null  bool  
 5   bias           150000 non-null  object
 6   url            150000 non-null  object
 7   labeled-by     150000 non-null  object
dtypes: bool(1), int64(1), object(6)
memory usage: 8.2+ MB


Unnamed: 0,id,published-at,title,article,hyperpartisan,bias,url,labeled-by
149995,1494825,,,<p>By Andrew Osborn</p> \n<p>MOSCOW (Reuters) ...,True,left,http://politicususa.com/2017/10/04/russia-thro...,publisher
149996,1494857,,I Now Pronounce You Spouse and Spouse,<p></p> \n<p>In keeping with its reputation of...,True,right,http://barbwire.com/2014/07/14/now-pronounce-s...,publisher
149997,1494877,2016-03-15,It's now clear that only a Democrat can stop D...,"<p><a href="""" type=""internal"">Donald Trump's</...",True,left,https://vox.com/2016/3/1/11144320/super-tuesda...,publisher
149998,1494883,2016-02-28,The Liberal Redneck: 'My proudest moment as a ...,"<p></p> \n<p>LR the Liberal Redneck here, comi...",True,left,http://americannewsx.com/politics/liberal-redn...,publisher
149999,1494893,,Obama’s Victory: Fourth Global Press Roundup,"<p></p> \n<p></p> \n<p>From <a href=""http://ww...",False,least,http://themoderatevoice.com/obamas-victory-fou...,publisher


In [None]:
vectorizer = TfidfVectorizer(
    lowercase=True,
    stop_words='english', # see https://aclanthology.org/W18-2502/
    max_features=5000,
    min_df=5,
    max_df=0.7)


seconds_start_time = time.time() # replace with timer.start, and timer.end methods. 

bag_of_words = vectorizer.fit_transform(df['article'])

print(f"vectorizer.fit_transform took {time.time() - seconds_start_time} seconds")

bag_of_words_df = pd.DataFrame(
    bag_of_words.toarray(), 
    columns=vectorizer.get_feature_names_out())

bag_of_words_df.head()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

x_train, x_test, y_train, y_test = train_test_split(bag_of_words_df, 
                                                    df['bias'], 
                                                    test_size=0.2, 
                                                    random_state=42, 
                                                    stratify=df['bias'])

In [None]:
# https://stackoverflow.com/questions/62658215/convergencewarning-lbfgs-failed-to-converge-status-1-stop-total-no-of-iter
# https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-definitions/52388406#52388406
# https://forecastegy.com/posts/how-to-solve-logistic-regression-not-converging-in-scikit-learn/
model = LogisticRegression(max_iter=1000)

seconds_start_time = time.time() # replace with timer.start, and timer.end methods. 
model.fit(x_train, y_train)
print(f"model.fit took {time.time() - seconds_start_time} seconds")

model.score(x_test, y_test)

In [None]:
for i in range(100):
    p = (model.predict(x_test.iloc[[i]]))[0]
    prb = model.predict_proba(x_test.iloc[[i]])
    a = ((y_test.iloc[[i]]).values)[0]
    matched = "matched"
    if p != a:
        matched = "NOT MATCHED"
    # print(f"{i} - {prb} predicted {p} - actual {a} - {matched}")

In [None]:
import matplotlib.pyplot as plt
x_train.head()

In [None]:
base_palette = sns.color_palette("Paired")
bias_palette = [
    base_palette[1], # blue (left)
    base_palette[0], # light blue (left-center)
    base_palette[8], # purple (least)
    base_palette[4], # light red (right-center)
    base_palette[5], # red (right)
]

data_distribution_chart = sns.displot(
    data = df,
    x = df.index,
    hue = "bias",
    multiple = "stack",
    height = 3,
    aspect = 3,
    hue_order = ['left', 'left-center', 'least', 'right-center', 'right'],
    palette = bias_palette)

data_distribution_chart.set_xlabels("article #")
data_distribution_chart.set_ylabels("scored bias")

plt.show()