> From the original csv use NLP to cluster values in the Other Identification column. Name those clusters, then replace the other interests column with a column per cluster name

Based on your instructions, I will perform the following steps:

1. Load the dataset.
2. Preprocess the 'Other Identification' column - filling NaN cells with empty strings and transforming all the text to lowercase.
3. Vectorize the text column ('Other Identification') using TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer.
4. Implement a KMeans clustering algorithm to group similar entries based on the TF-IDF matrix.
5. Determine appropriate cluster names based on the top terms within each cluster.
6. Replace the 'Other Identification' column with a binary column for each cluster.

Let's start with preprocessing of the 'Other Identification' column.

In [13]:
# Importing necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Load the data
df = pd.read_csv('cleaned.csv')

# Preprocessing
df['Other Identification'].fillna('', inplace=True)
df['Other Identification'] = df['Other Identification'].apply(lambda x: x.lower())

# Check the first few rows of the dataframe
df.head()

Unnamed: 0,id,city,region,country_name,latitude,longitude,Genealogical researcher,GLAM professional,Government employee,Legal researcher,...,Health,History,Indigenous Land Claims,Indigenous Studies,Information Science,Law,Linguistics,Literature,Political Science,Other Interest
0,118488584413,Ottawa,Ontario,CA,45.4112,-75.6981,1,0,0,0,...,0,0,0,0,0,0,0,0,0,
1,118488584313,Vancouver,British Columbia,CA,49.2497,-123.1193,0,0,0,0,...,0,1,1,0,0,0,0,0,0,
2,118488550699,Casas Adobes,Arizona,US,32.3234,-110.9951,1,0,0,0,...,0,0,0,0,0,0,0,0,0,chinese history in canada
3,118488550688,London,Ontario,CA,42.9834,-81.233,0,1,0,0,...,0,1,0,0,0,0,0,1,0,
4,118488541922,Hallowell,Maine,US,44.2859,-69.7909,0,0,0,0,...,0,1,0,1,0,0,0,1,0,


The dataframe, df, has been loaded successfully and preprocessing has been performed on the 'Other Identification' column - NaN values have been replaced with empty strings and all text has been transformed into lowercase. 

The next step is to transform the text data in the 'Other Identification' column into a format that can be used for analysis. For this, we will use the TF-IDF Vectorizer from the sklearn library. This vectorizer transforms the text data into a matrix of TF-IDF features.

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical statistic intended to reflect how important a word is to a document in a collection or corpus. It increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Let's proceed with vectorization.

In [14]:
# Vectorize the 'Other Identification' text data using TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['Other Identification'])

# Check the shape of the TF-IDF matrix
tfidf_matrix.shape

(1005, 147)

The TF-IDF vectorization of the 'Other Identification' column was successful. It results in a TF-IDF matrix with 1005 documents (entries in the 'Other Identification' column) and 147 unique terms (words) across all documents.

We can now proceed to the next stage, which is using KMeans clustering to group these entries into clusters. To decide the number of clusters, we will start with a reasonable number like five. Afterward, we can adjust this number based on the output and requirements. Let's now perform the KMeans clustering.

In [15]:
# Specify number of clusters
num_clusters = 5

# Perform KMeans clustering
km = KMeans(n_clusters=5)

# Fit the KMeans model onto this vectorized data
km.fit(tfidf_matrix)

# Get the cluster assignments for each data point
clusters = km.labels_.tolist()

# Check the first few cluster assignments
clusters[:10]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

The KMeans clustering algorithm was successful, and it appears that the first ten entries in the dataframe were all assigned to the same cluster (cluster 0). The cluster assignments provide us a way to group similar 'Other Identification' entries together. 

The next step would be to name these clusters meaningfully. We'll use the common words (features) in each cluster to name them. For example, if the top terms in a cluster are ['organized', 'professional', 'business'], then we might name that cluster 'Professional Business'. Naming the clusters would allow us to replace the 'Other Identification' column in a more interpretable way.

Let's identify these top terms for each cluster and then name the clusters accordingly. To do this, we'll list the top N words close to the centroid of each cluster in the multi-dimensional TF-IDF space. The centroid of a cluster in this context essentially represents the "averaged theme" of all the text entries falling into that cluster.

Let's proceed with this step.

In [16]:
# Identify top N words in each cluster
N = 3
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = tfidf_vectorizer.get_feature_names_out()

for i in range(num_clusters):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :N]:
        print(' %s' % terms[ind], end='')
    print()

Cluster 0: writer retired professor
Cluster 1: researcher historical society
Cluster 2: historian amateur local
Cluster 3: editor and historian
Cluster 4: independent researcher scholar


We have successfully identified the top 3 terms in each cluster. We can thus name the clusters as follows based on these terms:

- Cluster 0: 'Writer, Retired, Professor'
- Cluster 1: 'Researcher, Historical Society'
- Cluster 2: 'Historian, Amateur, Local'
- Cluster 3: 'Editor and Historian'
- Cluster 4: 'Independent Researcher, Scholar'

Next, we replace the 'Other Identification' column with columns representing these cluster names. These replacements will be performed such that if the 'Other Identification' of a particular row falls within a specific cluster, the respective cluster column for that row will be marked 1 (respresenting True), otherwise it would be 0 (representing False). Let's proceed with this.

In [17]:
# Add the cluster assignments to the dataframe
df['Cluster'] = clusters

# Given names for each cluster
names = ['Writer, Retired, Professor', 'Researcher, Historical Society', 'Historian, Amateur, Local', 'Editor and Historian', 'Independent Researcher, Scholar']

# Iterate over cluster names (0, 1, 2, 3, 4)
for i in range(num_clusters):
    # Create a new column for each cluster 
    df[names[i]] = df['Cluster'] == i

# Convert the cluster assignment boolean values to integer
# True to 1 and False to 0
df.iloc[:, -num_clusters:] = df.iloc[:, -num_clusters:].astype(int)
    
# Drop original 'Other Identification' and 'Cluster' columns as we have now got individual cluster columns
df.drop(columns=['Other Identification', 'Cluster'], inplace=True)

# Check the first few rows of the updated dataframe
df.head()

Unnamed: 0,id,city,region,country_name,latitude,longitude,Genealogical researcher,GLAM professional,Government employee,Legal researcher,...,Law,Linguistics,Literature,Political Science,Other Interest,"Writer, Retired, Professor","Researcher, Historical Society","Historian, Amateur, Local",Editor and Historian,"Independent Researcher, Scholar"
0,118488584413,Ottawa,Ontario,CA,45.4112,-75.6981,1,0,0,0,...,0,0,0,0,,1,0,0,0,0
1,118488584313,Vancouver,British Columbia,CA,49.2497,-123.1193,0,0,0,0,...,0,0,0,0,,1,0,0,0,0
2,118488550699,Casas Adobes,Arizona,US,32.3234,-110.9951,1,0,0,0,...,0,0,0,0,chinese history in canada,1,0,0,0,0
3,118488550688,London,Ontario,CA,42.9834,-81.233,0,1,0,0,...,0,0,1,0,,1,0,0,0,0
4,118488541922,Hallowell,Maine,US,44.2859,-69.7909,0,0,0,0,...,0,0,1,0,,1,0,0,0,0


The "Other Identification" column has been successfully replaced with the columns representing the cluster names. Each row now has a 1 in the column of the cluster it is a part of, and 0 in the rest. This means, for example, if 'Writer, Retired, Professor' is 1, then it implies that 'Other Identification' of that particular entry was found to be most closely associated with the cluster represented by the terms [writer, retired, professor].

In summary:

1. I loaded the 'cleaned.csv' dataset and preprocessed the 'Other Identification' column.
2. I vectorized the column's text data using a TF-IDF Vectorizer.
3. I ran a KMeans clustering algorithm to group similar entries based on the TF-IDF matrix.
4. Each resulting cluster was then named based on the top three cluster's terms.
5. Finally, the 'Other Identification' column was replaced with binary columns for each cluster, thus accomplishing your goal.

Please let me know if there's anything more you need or if you want to make some changes.

> Do the same as the previous step but for the Other Interest column

In [18]:
# Preprocessing
df['Other Interest'].fillna('', inplace=True)
df['Other Interest'] = df['Other Interest'].apply(lambda x: x.lower())

# Vectorize the 'Other Interest' text data using TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['Other Interest'])

# Specify number of clusters
num_clusters = 5

# Perform KMeans clustering
km = KMeans(n_clusters=5)

# Fit the KMeans model onto this vectorized data
km.fit(tfidf_matrix)

# Get the cluster assignments for each data point
clusters = km.labels_.tolist()

# Identify top N words in each cluster
N = 3
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = tfidf_vectorizer.get_feature_names_out()

top_terms = {}
for i in range(num_clusters):
    top_terms[i] = [terms[ind] for ind in order_centroids[i, :N]]

top_terms

{0: ['anthropology', 'studies', 'archaeology'],
 1: ['history', 'local', 'railway'],
 2: ['theology', 'biblical', 'psychotherapy'],
 3: ['architecture', 'youth', 'garden'],
 4: ['religion', 'youth', 'equality']}

I have successfully performed preprocessing, vectorization, clustering, and naming of clusters for the 'Other Interest' column. The top three terms in each cluster are as follows:

- Cluster 0: ['Anthropology', 'Studies', 'Archaeology']
- Cluster 1: ['History', 'Local', 'Railway']
- Cluster 2: ['Theology', 'Biblical', 'Psychotherapy']
- Cluster 3: ['Architecture', 'Youth', 'Garden']
- Cluster 4: ['Religion', 'Youth', 'Equality']

Now, I'll replace the 'Other Interest' column with one column for each cluster, indicating the membership of each row in these clusters.

In [19]:
# Naming the clusters based on top terms
names = ['Anthropology, Studies, Archaeology', 'History, Local, Railway', 'Theology, Biblical, Psychotherapy', 'Architecture, Youth, Garden', 'Religion, Youth, Equality']

# Adding Cluster column to the dataframe
df['Cluster'] = clusters

# Iterate over cluster names
for i in range(num_clusters):
    # Create a new column for each cluster
    df[names[i]] = df['Cluster'] == i

# Convert the boolean values to integer (1 for True, 0 for False)
df.iloc[:, -num_clusters:] = df.iloc[:, -num_clusters:].astype(int)

# Drop the original 'Other Interest' and 'Cluster' columns
df.drop(columns=['Other Interest', 'Cluster'], inplace=True)

# Display the first few rows of the dataframe
df.head()

Unnamed: 0,id,city,region,country_name,latitude,longitude,Genealogical researcher,GLAM professional,Government employee,Legal researcher,...,Information Science,Law,Linguistics,Literature,Political Science,"Anthropology, Studies, Archaeology","History, Local, Railway","Theology, Biblical, Psychotherapy","Architecture, Youth, Garden","Religion, Youth, Equality"
0,118488584413,Ottawa,Ontario,CA,45.4112,-75.6981,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,118488584313,Vancouver,British Columbia,CA,49.2497,-123.1193,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,118488550699,Casas Adobes,Arizona,US,32.3234,-110.9951,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,118488550688,London,Ontario,CA,42.9834,-81.233,0,1,0,0,...,0,0,0,1,0,1,0,0,0,0
4,118488541922,Hallowell,Maine,US,44.2859,-69.7909,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0


The 'Other Interest' column has been removed successfully and replaced with five new columns representing the cluster categories identified previously. Each row is now marked with a 1 if it belongs to that cluster and 0 otherwise.

To summarize, I carried out the following steps:
1. Loaded the dataset and filled missing values in the 'Other Interest' column.
2. Preprocessed this column by transforming all text to lowercase.
3. Converted the text data into a TF-IDF matrix using TfidfVectorizer.
4. Performed a KMeans clustering with 5 clusters on this matrix.
5. Identified the top three terms per cluster to name each one:
    - Cluster 0: 'Anthropology, Studies, Archaeology'
    - Cluster 1: 'History, Local, Railway'
    - Cluster 2: 'Theology, Biblical, Psychotherapy'
    - Cluster 3: 'Architecture, Youth, Garden'
    - Cluster 4: 'Religion, Youth, Equality'
6. Finally, replaced the original 'Other Interest' column in the dataframe with a column for each cluster category. The column for the cluster a row belongs to is marked as 1, and all others are marked as 0.

These newly created columns present a much more clear sense of what the original 'Other Interest' content implied in terms of certain broad categories defined by the clusters. Feel free to ask if you need more information or further analyses.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=78bfc222-b866-4a43-b8a8-c2a8d5467ce5' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>