### PMI Scores and POS tagging - and combining these
In the following I first calculate PMI scores, these calculation are done using code from Pew Analytics Research. I copy their function, but full their full code see python file called: 'mutual_info.py'. See details of these function here: https://pewresearch.github.io/pewanalytics/examples.html#mutual-information.

I first do this on the full dataset texts, and then where I remove stop words as a robustness check.

I then implement the BERT part of speech tagger from Hugging Face on the top 5000 words associated with women's football articles and men's football articles: https://huggingface.co/tasks/token-classification.

This code was run in kaggle due to the higher computational power which was available, hence the reference to kaggle.



In [None]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
import kagglehub
kagglehub.login()


In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

emmastoklee_liwc_analysis_men_csv_path = kagglehub.dataset_download('emmastoklee/liwc-analysis-men-csv')
emmastoklee_pmi_scores_path = kagglehub.dataset_download('emmastoklee/pmi-scores')

print('Data source import complete.')


In [None]:
import pandas as pd
import math
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from pandas import json_normalize
import json
from scipy.sparse import csr_matrix


In [None]:
def is_null(val, empty_lists_are_null=False, custom_nulls=None):

    """
    Returns the opposite of the outcome of :py:func:`pewtils.is_not_null`. The following values are always \
    considered null: ``numpy.nan, None, "None", "nan", "", " ", "NaN", "none", "n/a", "NONE", "N/A"``

    :param val: The value to check
    :param empty_lists_are_null: Whether or not an empty list or :py:class:`pandas.DataFrame` should be considered \
    null (default=False)
    :type empty_lists_are_null: bool
    :param custom_nulls: an optional list of additional values to consider as null
    :type custom_nulls: list
    :return: True if the value is null
    :rtype: bool

    Usage::

        from pewtils import is_null

        >>> empty_list = []
        >>> is_null(empty_list, empty_lists_are_null=True)
        True
    """

    return not is_not_null(
        val, empty_lists_are_null=empty_lists_are_null, custom_nulls=custom_nulls
    )



def is_not_null(val, empty_lists_are_null=False, custom_nulls=None):

    """
    Checks whether the value is null, using a variety of potential string values, etc. The following values are always
    considered null: ``numpy.nan, None, "None", "nan", "", " ", "NaN", "none", "n/a", "NONE", "N/A"``

    :param val: The value to check
    :param empty_lists_are_null: Whether or not an empty list or :py:class:`pandas.DataFrame` should be considered \
    null (default=False)
    :type empty_lists_are_null: bool
    :param custom_nulls: an optional list of additional values to consider as null
    :type custom_nulls: list
    :return: True if the value is not null
    :rtype: bool

    Usage::

        from pewtils import is_not_null

        >>> text = "Hello"
        >>> is_not_null(text)
        True
    """

    null_values = [None, "None", "nan", "", " ", "NaN", "none", "n/a", "NONE", "N/A"]
    if custom_nulls:
        null_values.extend(custom_nulls)
    if type(val) == list:
        if empty_lists_are_null and val == []:
            return False
        else:
            return True
    elif isinstance(val, pd.Series) or isinstance(val, pd.DataFrame):
        if empty_lists_are_null and len(val) == 0:
            return False
        else:
            return True
    else:
        try:
            try:
                good = val not in null_values
                if good:
                    try:
                        try:
                            good = not pd.isnull(val)
                        except IndexError:
                            good = True
                    except AttributeError:
                        good = True
                return good
            except ValueError:
                return val.any()
        except TypeError:
            return not isinstance(val, None)




def scale_range(old_val, old_min, old_max, new_min, new_max):

    """
    Scales a value from one range to another.  Useful for comparing values from different scales, for example.

    :param old_val: The value to convert
    :type old_val: int or float
    :param old_min: The minimum of the old range
    :type old_min: int or float
    :param old_max: The maximum of the old range
    :type old_max: int or float
    :param new_min: The minimum of the new range
    :type new_min: int or float
    :param new_max: The maximum of the new range
    :type new_max: int or float
    :return: Value equivalent from the new scale
    :rtype: float

    Usage::

        from pewtils import scale_range

        >>> old_value = 5
        >>> scale_range(old_value, 0, 10, 0, 20)
        10.0
    """

    return (
        ((float(old_val) - float(old_min)) * (float(new_max) - float(new_min)))
        / (float(old_max) - float(old_min))
    ) + float(new_min)

In [None]:

def compute_mutual_info(y, x, weights=None, col_names=None, l=0, normalize=True):

    """
    Computes pointwise mutual information for a set of observations partitioned into two groups.

    :param y: An array or, preferably, a :py:class:`pandas.Series`
    :param x: A matrix, :py:class:`pandas.DataFrame`, or preferably a :py:class:`scipy.sparse.csr_matrix`
    :param weights: (Optional) An array of weights corresponding to each observation
    :param col_names: The feature names associated with the columns in matrix 'x'
    :type col_names: list
    :param l: An optional Laplace smoothing parameter
    :type l: int or float
    :param normalize: Toggle normalization on or off (to control for feature prevalance), on by default
    :type normalize: bool
    :return: A :py:class:`pandas.DataFrame` of features with a variety of computed metrics including mutual information.

    The function expects ``y`` to correspond to a list or series of values indicating which partition an observation \
    belongs to. ``y`` must be a binary flag. ``x`` is a set of features (either a :py:class:`pandas.DataFrame` or \
    sparse matrix) where the rows correspond to observations and the columns represent the presence of features (you \
    can technically run this using non-binary features but the results will not be as readily interpretable.) The \
    function returns a :py:class:`pandas.DataFrame` of metrics computed for each feature, including the following \
    columns:

    - ``MI1``: The feature's mutual information for the positive class
    - ``MI0``: The feature's mutual information for the negative class
    - ``total``: The total number of times a feature appeared
    - ``total_pos_with_term``: The total number of times a feature appeared in positive cases
    - ``total_neg_with_term``: The total number of times a feature appeared in negative cases
    - ``total_pos_neg_with_term_diff``: The raw difference in the number of times a feature appeared in positive cases \
    relative to negative cases
    - ``pct_pos_with_term``: The proportion of positive cases that had the feature
    - ``pct_neg_with_term``: The proportion of negative cases that had the feature
    - ``pct_pos_neg_with_term_ratio``: A likelihood ratio indicating the degree to which a positive case was more likely \
    to have the feature than a negative case
    - ``pct_term_pos``: Of the cases that had a feature, the proportion that were in the positive class
    - ``pct_term_neg``: Of the cases that had a feature, the proportion that were in the negative class
    - ``pct_term_pos_neg_diff``: The percentage point difference between the proportion of cases with the feature that \
    were positive vs. negative
    - ``pct_term_pos_neg_ratio``: A likelihood ratio indicating the degree to which a feature was more likely to appear \
    in a positive case relative to a negative one (may not be meaningful when classes are imbalanced)

    .. note:: Note that ``pct_term_pos`` and ``pct_term_neg`` may not be directly comparable if classes are imbalanced, \
        and in such cases a ``pct_term_pos_neg_diff`` above zero or ``pct_term_pos_neg_ratio`` above 1 may not indicate a \
        true association with the positive class if positive cases outnumber negative ones.

    .. note:: Mutual information can be a difficult metric to explain to others. We've found that the \
        ``pct_pos_neg_with_term_ratio`` can serve as a more interpretable alternative method for identifying \
        meaningful differences between groups.

    Usage::

        from pewanalytics.stats.mutual_info import compute_mutual_info
        import nltk
        import pandas as pd
        from sklearn.metrics.pairwise import linear_kernel
        from sklearn.feature_extraction.text import TfidfVectorizer

        nltk.download("inaugural")
        df = pd.DataFrame([
            {"speech": fileid, "text": nltk.corpus.inaugural.raw(fileid)} for fileid in nltk.corpus.inaugural.fileids()
        ])
        df['year'] = df['speech'].map(lambda x: int(x.split("-")[0]))
        df['21st_century'] = df['year'].map(lambda x: 1 if x >= 2000 else 0)

        vec = TfidfVectorizer(min_df=10, max_df=.9).fit(df['text'])
        tfidf = vec.transform(df['text'])

        # Here are the terms most distinctive of inaugural addresses in the 21st century vs. years prior

        >>> results = compute_mutual_info(df['21st_century'], tfidf, col_names=vec.get_feature_names())

        >>> results.sort_values("MI1", ascending=False).index[:25]
        Index(['america', 'thank', 'bless', 'schools', 'ideals', 'americans',
               'meaning', 'you', 'move', 'across', 'courage', 'child', 'birth',
               'generation', 'families', 'build', 'hard', 'promise', 'choice', 'women',
               'guided', 'words', 'blood', 'dignity', 'because'],
              dtype='object')

    """

    if is_not_null(weights):
        weights = weights.fillna(0)
        y0 = sum(weights[y == 0])
        y1 = sum(weights[y == 1])
        total = sum(weights)
    else:
        y0 = len(y[y == 0])
        y1 = len(y[y == 1])
        total = y1 + y0

    if type(x).__name__ == "csr_matrix":

        if is_not_null(weights):
            x = x.transpose().multiply(csr_matrix(weights)).transpose()
        x1 = pd.Series(x.sum(axis=0).tolist()[0])
        x0 = total - x1
        x1y0 = pd.Series(
            x[np.ravel(np.array(y[y == 0].index)), :].sum(axis=0).tolist()[0]
        )
        x1y1 = pd.Series(
            x[np.ravel(np.array(y[y == 1].index)), :].sum(axis=0).tolist()[0]
        )

    else:

        if type(x).__name__ != "DataFrame":
            x = pd.DataFrame(x, columns=col_names)

        if is_not_null(weights):
            x = x.multiply(weights, axis="index")
            x1 = x.multiply(weights, axis="index").sum()
            x0 = ((x * -1) + 1).multiply(weights, axis="index").sum()
        else:
            x1 = x.sum()
            x0 = ((x * -1) + 1).sum()
        x1y0 = x[y == 0].sum()
        x1y1 = x[y == 1].sum()

    px1y0 = x1y0 / total
    px1y1 = x1y1 / total
    px0y0 = (y0 - x1y0) / total
    px0y1 = (y1 - x1y1) / total

    px1 = x1 / total
    px0 = x0 / total
    py1 = float(y1) / float(total)
    py0 = float(y0) / float(total)

    MI1 = (px1y1 / (px1 * py1) + l).map(lambda v: math.log(v, 2) if v > 0 else 0)
    if normalize:
        MI1 = MI1 / (-1 * px1y1.map(lambda v: math.log(v, 2) if v > 0 else 0))

    MI0 = (px1y0 / (px1 * py0) + l).map(lambda v: math.log(v, 2) if v > 0 else 0)
    if normalize:
        MI0 = MI0 / (-1 * px1y0.map(lambda v: math.log(v, 2) if v > 0 else 0))

    df = pd.DataFrame()

    df["MI1"] = MI1
    df["MI0"] = MI0

    df["total"] = x1
    df["total_pos_with_term"] = x1y1  # total_pos_mention
    df["total_neg_with_term"] = x1y0  # total_neg_mention
    df["total_pos_neg_with_term_diff"] = (
        df["total_pos_with_term"] - df["total_neg_with_term"]
    )
    df["pct_with_term"] = x1 / (x1 + x0)
    df["pct_pos_with_term"] = x1y1 / y1  # pct_pos_mention
    df["pct_neg_with_term"] = x1y0 / y0  # pct_neg_mention
    df["pct_pos_neg_with_term_diff"] = (
        df["pct_pos_with_term"] - df["pct_neg_with_term"]
    )  # pct_pos_neg_mention_diff
    df["pct_pos_neg_with_term_ratio"] = df["pct_pos_with_term"] / (
        df["pct_neg_with_term"]
    )  # pct_pos_neg_mention_ratio

    df["pct_term_pos"] = x1y1 / x1  # pct_mention_pos
    df["pct_term_neg"] = x1y0 / x1  # pct_mention_neg
    df["pct_term_pos_neg_diff"] = (
        df["pct_term_pos"] - df["pct_term_neg"]
    )  # pct_mention_pos_neg_diff
    df["pct_term_pos_neg_ratio"] = df["pct_term_pos"] / df["pct_term_neg"]

    if col_names is not None and len(col_names) > 0:
        df.index = col_names

    return df


def mutual_info_bar_plot(
    mutual_info,
    filter_col="MI1",
    top_n=50,
    x_col="pct_term_pos_neg_ratio",
    color="grey",
    title=None,
    width=10,
):
    """
    Takes a mutual information table generated by :py:func:`pewanalytics.stats.mutual_info.compute_mutual_info`, \
    and generates a bar plot of top features. Allows for an easy visualization of feature differences. Can \
    subsequently call :py:func:`plt.show` or :py:func:`plt.savefig` to display or save the plot.

    :param mutual_info: A mutual information table generated by \
    :py:func:`pewanalytics.stats.mutual_info.compute_mutual_info`
    :param filter_col: The column to use when selecting top features; sorts in descending order and picks the \
    top ``top_n``
    :type filter_col: str
    :param top_n: The number of features to display
    :type top_n: int
    :param x_col: The column by which to sort the final set of top features (after they have been selected by \
    ``filter_col``
    :type x_col: str
    :param color: The color of the bars
    :type color: str
    :param title: The title of the plot
    :type title: str
    :param width: The width of the plot
    :type width: int
    :return: A Matplotlib figure, which you can display via ``plt.show()`` or alternatively save to a file via \
    ``plt.savefig(FILEPATH)``
    """

    import seaborn
    import matplotlib.pyplot as plt

    mutual_info = mutual_info.sort_values(filter_col, ascending=False)[:top_n]
    mutual_info = mutual_info.sort_values(x_col, ascending=False)
    mutual_info["ngram"] = mutual_info.index
    buffer = 0.02 * abs(mutual_info[x_col].max() - mutual_info[x_col].min())
    plt.figure(figsize=(width, float(len(mutual_info) * 0.35)))
    seaborn.set_color_codes("pastel")  # noqa: F821
    g = seaborn.barplot(x=x_col, y="ngram", data=mutual_info, color=color)  # noqa: F821
    seaborn.despine(offset=10, trim=True)  # noqa: F821
    for i, row in enumerate(mutual_info.iterrows()):
        index, row = row
        g.text(
            x=row[x_col] + buffer,
            y=i,
            s=row["ngram"],
            horizontalalignment="left",
            verticalalignment="center",
            size="large",
            color=color,
        )
    g.set_title(title)

    return g


def mutual_info_scatter_plot(
    mutual_info,
    filter_col="MI0",
    top_n=25,
    x_col="pct_term_pos_neg_ratio",
    xlabel=None,
    scale_x_even=True,
    y_col="MI1",
    ylabel=None,
    scale_y_even=True,
    color="grey",
    color_col="MI0",
    size_col="pct_pos_with_term",
    title=None,
    figsize=(10, 10),
    adjust_text=False,
):
    """

    Takes a mutual information table generated by :py:func:`pewanalytics.stats.mutual_info.compute_mutual_info`, \
    and generates a scatter plot of top features.
    The names of the features will be displayed with varying colors and sizes depending on the variables specified
    in ``color_col`` and ``size_col``. Allows for an easy visualization of feature differences. Can subsequently
    call :py:func:`plt.show` or :py:func:`plt.savefig` to display or save the plot.

    :param mutual_info: A mutual information table generated by \
    :py:func:`pewanalytics.stats.mutual_info.compute_mutual_info`
    :param filter_col: The column to use when selecting top features; sorts in descending order and picks the top \
    ``top_n``
    :type filter_col: str
    :param top_n: The number of features to display
    :type top_n: int
    :param x_col: The column to use as the x-axis
    :type x_col: str
    :param xlabel: Label for the x-axis
    :type xlabel: str
    :param scale_x_even: If True, set values to their ordered rank (allows for even spacing)
    :type scale_x_even: bool
    :param y_col: The column to use as the y-axis
    :type y_col: str
    :param ylabel: Label for the y-axis
    :type ylabel: str
    :param scale_y_even: If True, set values to their ordered rank (allows for even spacing)
    :type scale_y_even: bool
    :param color: The color for the features
    :type color: str
    :param color_col: The column to use when shading the features
    :type color_col: str
    :param size_col: The column to use to size the features
    :type size_col: str
    :param title: The title of the plot
    :type title: str
    :param figsize: The size of the plot (tuple)
    :type figsize: tuple
    :param adjust_text: If True, attempts to adjusts the text so it doesn't overlap
    :type adjust_text: bool
    :return: A Matplotlib figure, which you can display via ``plt.show()`` or alternatively save to a file via \
    ``plt.savefig(FILEPATH)``
    """

    import seaborn
    import matplotlib.pyplot as plt

    mutual_info = mutual_info.sort_values(filter_col, ascending=False)[:top_n]

    if scale_x_even:
        mutual_info = mutual_info.sort_values(x_col)
        mutual_info["{}_rank".format(x_col)] = mutual_info.reset_index().index + 1
        x_col = "{}_rank".format(x_col)
    if scale_y_even:
        mutual_info = mutual_info.sort_values(y_col)
        mutual_info["{}_rank".format(y_col)] = mutual_info.reset_index().index + 1
        y_col = "{}_rank".format(y_col)

    color_maps = {
        "grey": plt.cm.Greys,
        "purple": plt.cm.Purples,
        "blue": plt.cm.Blues,
        "green": plt.cm.Greens,
        "orange": plt.cm.Oranges,
        "red": plt.cm.Reds,
    }

    _, ax = plt.subplots(figsize=figsize)
    seaborn.scatterplot(
        data=mutual_info, x=x_col, y=y_col, legend=False, alpha=0.0, ax=ax
    )  # noqa: F821
    seaborn.set_color_codes("pastel")  # noqa: F821

    mutual_info["size"] = mutual_info[size_col].map(
        lambda x: scale_range(
            x,
            mutual_info[size_col].min(),
            mutual_info[size_col].max(),
            (figsize[0] * 3),
            (figsize[0] * 5),
        )
    )
    mutual_info["color"] = mutual_info[color_col].map(
        lambda x: scale_range(
            x, mutual_info[color_col].min(), mutual_info[color_col].max(), 0.4, 1
        )
    )
    mutual_info["color"] = mutual_info["color"].map(color_maps[color])

    mutual_info["x"] = mutual_info[x_col]
    mutual_info["y"] = mutual_info[y_col]
    ax.set_title(title)
    ax.set_xlim((mutual_info["x"].min(), mutual_info["x"].max()))
    ax.set_ylim((mutual_info["y"].min(), mutual_info["y"].max()))
    ax.set_ylabel(ylabel)
    ax.set_xlabel(xlabel)

    texts = []
    for index, row in mutual_info.iterrows():
        texts.append(
            ax.text(
                row["x"],
                row["y"],
                index,
                size=row["size"],
                color=row["color"],
                horizontalalignment="right"
                if row["x"] > (mutual_info["x"].max() / 2.0)
                else "left",
                verticalalignment="top"
                if row["y"] > (mutual_info["y"].max() / 2.0)
                else "bottom",
            )
        )
    if adjust_text:
        adjust_text_function(
            texts, arrowprops=dict(arrowstyle="-", color="black", lw=0.5, alpha=0.5)
        )

    seaborn.despine(offset=10, trim=True)  # noqa: F821

    return ax


In [None]:
# liwc_analysis = pd.read_csv('../Documents/Social Data Science - Masters/Thesis/Data/LIWC-22 Results - guardian_classified_clean - LIWC Analysis.csv')

In [None]:
liwc_analysis_men = pd.read_csv('/kaggle/input/liwc-analysis-men-csv/liwc_analysis_men.csv')


In [None]:
liwc_analysis_women = pd.read_csv('/kaggle/input/liwc-analysis-men-csv/liwc_analysis_women.csv')


In [None]:


# Sample data (assuming liwc_analysis_men and liwc_analysis_women are DataFrames)
# Test for the first 10 rows of the 'article_text' column
men_text = [str(text).split() for text in liwc_analysis_men['article_text'] if text is not None]
women_text = [str(text).split() for text in liwc_analysis_women['article_text'] if text is not None]

# Join the lists back into strings so they are ready for vectorization
men_text = [" ".join(text) for text in men_text]
women_text = [" ".join(text) for text in women_text]

# Combine both men_text and women_text for vectorization
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(men_text + women_text)

# Convert to dense format if necessary
X_dense = X.toarray()

# Define labels: 0 for men, 1 for women
y = np.array([0] * len(men_text) + [1] * len(women_text))

# Define column names for the features
col_names = vectorizer.get_feature_names_out()

# Compute mutual information
result_full = compute_mutual_info(y, X_dense, col_names=col_names)



In [None]:
# save result full
result_full.to_csv('/content/drive/MyDrive/SDS/Thesis/Data/pmi_result_full.csv')

In [None]:
# womens text
result_full.sort_values("MI1", ascending=False).index[:50]

In [None]:
# mens text
result_full.sort_values("MI0", ascending=False).index[:50]

In [None]:
mutual_info_bar_plot(result_full)

In [None]:
mutual_info_scatter_plot(result_full)

In [None]:
mutual_info_scatter_plot(result_full)

In [None]:
# create version where i remove stopwords to compare

In [None]:

# Sample data (assuming liwc_analysis_men and liwc_analysis_women are DataFrames)
# Test for the first 10 rows of the 'article_text' column
men_text = [str(text).split() for text in subsample_men['article_text'] if text is not None]
women_text = [str(text).split() for text in liwc_analysis_women['article_text'] if text is not None]

# Join the lists back into strings so they are ready for vectorization
men_text = [" ".join(text) for text in men_text]
women_text = [" ".join(text) for text in women_text]

# Combine both men_text and women_text for vectorization
# Here we set stop_words='english' to remove common stop words
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(men_text + women_text)

# Convert to dense format if necessary
X_dense = X.toarray()

# Define labels: 0 for men, 1 for women
y = np.array([0] * len(men_text) + [1] * len(women_text))

# Define column names for the features
col_names = vectorizer.get_feature_names_out()

# Compute mutual information
result_full_stopwords = compute_mutual_info(y, X_dense, col_names=col_names)


In [None]:
# womens text
result_full_stopwords.sort_values("MI1", ascending=False).index[:50]

In [None]:
# mens text
result_full_stopwords.sort_values("MI0", ascending=False).index[:50]

In [None]:
mutual_info_scatter_plot(result_full_stopwords)

In [None]:
mutual_info_scatter_plot(result_full_stopwords)

In [None]:
# save result full
result_full_stopwords.to_csv('/kaggle/working/pmi_result_stopwords.csv')

In [None]:
# read in pmi results
pmi_results = pd.read_csv('/kaggle/input/pmi-scores/pmi_result_full.csv')

In [None]:
pmi_results.head()

Unnamed: 0.1,Unnamed: 0,MI1,MI0,total,total_pos_with_term,total_neg_with_term,total_pos_neg_with_term_diff,pct_with_term,pct_pos_with_term,pct_neg_with_term,pct_pos_neg_with_term_diff,pct_pos_neg_with_term_ratio,pct_term_pos,pct_term_neg,pct_term_pos_neg_diff,pct_term_pos_neg_ratio
0,0,-0.028921,0.002902,434,23,411,-388,0.007802,0.006228,0.007914,-0.001686,0.786971,0.052995,0.947005,-0.894009,0.055961
1,0,0.062718,-0.012728,12330,1050,11280,-10230,0.221655,0.284322,0.217199,0.067123,1.309039,0.085158,0.914842,-0.829684,0.093085
2,0,,0.00699,3,0,3,-3,5.4e-05,0.0,5.8e-05,-5.8e-05,0.0,0.0,1.0,-1.0,0.0
3,0,,0.006287,1,0,1,-1,1.8e-05,0.0,1.9e-05,-1.9e-05,0.0,0.0,1.0,-1.0,0.0
4,0,,0.006287,1,0,1,-1,1.8e-05,0.0,1.9e-05,-1.9e-05,0.0,0.0,1.0,-1.0,0.0


In [None]:
pmi_results.set_index('Unnamed: 0', inplace=True)

In [None]:
# womens text
pmi_results.sort_values("MI1", ascending=False).index[:500]

Index(['her', 'she', 'women', 'kerr', 'matildas', 'hayes', 'wsl', 'mead',
       'bronze', 'wiegman',
       ...
       'sembrant', 'swaby', 'voss', 'wilms', 'woman', 'huitema', 'linari',
       'rosengard', 'kundananji', 'gwinn'],
      dtype='object', name='Unnamed: 0', length=500)

In [None]:
women_5000_pmi = pmi_results.sort_values("MI1", ascending=False).index[:5000].tolist()


In [None]:
men_5000_pmi = pmi_results.sort_values("MI0", ascending=False).index[:5000].tolist()


Now for using the BERT POS tagger

In [None]:
import torch

if torch.cuda.is_available():
    print(f"GPU is available: {torch.cuda.get_device_name(0)}")
else:
    print("GPU is not available.")

GPU is available: Tesla T4


In [None]:
from transformers import pipeline

classifier = pipeline("token-classification", model = "vblagoje/bert-english-uncased-finetuned-pos", device=0)


config.json:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at vblagoje/bert-english-uncased-finetuned-pos were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
women_pmi_pos = classifier(women_5000_pmi)


In [None]:
men_pmi_pos = classifier(men_5000_pmi)

In [None]:
# Extract the first entry from each list
first_entries = []
for article in women_pmi_pos:
    if article:  # Check if the list is not empty
        first_entry = article[0]
        first_entries.append({
            'POS_Tag': first_entry['entity'],
            'Confidence_Score': first_entry['score']
        })

women_pmi_pos_df = pd.DataFrame(first_entries)

women_pmi_pos_df['word'] = women_5000_pmi



In [None]:
# Extract the first entry from each list
first_entries = []
for article in men_pmi_pos:
    if article:  # Check if the list is not empty
        first_entry = article[0]
        first_entries.append({
            'POS_Tag': first_entry['entity'],
            'Confidence_Score': first_entry['score']
        })

men_pmi_pos_df = pd.DataFrame(first_entries)

men_pmi_pos_df['word'] = men_5000_pmi

In [None]:
women_pmi_pos_df.to_csv('/kaggle/working/pmi_pos_women_all.csv')

In [None]:
men_pmi_pos_df.to_csv('/kaggle/working/pmi_pos_men_all.csv')