In [2]:
import pandas as pd
from prefixspan import PrefixSpan
import ast
import warnings

In [3]:
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('data_final_fix.csv')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10181 entries, 0 to 10180
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   timestamp       10181 non-null  object 
 1   course_url      10181 non-null  object 
 2   title           10181 non-null  object 
 3   headline        10178 non-null  object 
 4   is_bestseller   10181 non-null  object 
 5   rating          10181 non-null  float64
 6   num_reviews     10181 non-null  object 
 7   num_students    10181 non-null  object 
 8   instructor      10181 non-null  object 
 9   language        10181 non-null  object 
 10  price           10181 non-null  float64
 11  discount        9831 non-null   object 
 12  related_topics  10181 non-null  object 
 13  sections        10181 non-null  float64
 14  lectures        10181 non-null  float64
 15  total_length    10181 non-null  object 
dtypes: float64(4), object(12)
memory usage: 1.2+ MB


In [7]:
def parse_topics(topics_str):
    """
    Parses the 'related_topics' string.
    It handles:
    1. Empty/NaN values.
    2. Strings that are lists "['a', 'b']"
    3. Strings that are simple comma-separated "a, b, c"
    """
    # Handle empty or NaN values
    if not isinstance(topics_str, str) or topics_str == "[]":
        return []

    try:
        # First, try to evaluate it as a Python literal (e..g., "['a', 'b']")
        if topics_str.startswith('[') and topics_str.endswith(']'):
            parsed_list = ast.literal_eval(topics_str)
            # Ensure all items are strings
            return [str(item).strip() for item in parsed_list]

        # If not a list-string, treat as comma-separated (e.g., "a, b, c")
        else:
            return [topic.strip() for topic in topics_str.split(',')]

    except (ValueError, SyntaxError):
        # Fallback for malformed strings
        return [topic.strip() for topic in topics_str.split(',')]
    except Exception:
        # Catch-all for other errors
        return []

print("Helper function 'parse_topics' is defined.")

Helper function 'parse_topics' is defined.


In [8]:
df['topic_sequence'] = df['related_topics'].apply(parse_topics)

df_sequences = df[df['topic_sequence'].map(len) > 0]

sequence_db = df_sequences['topic_sequence'].tolist()

In [9]:
print(sequence_db[:5])

[['Machine Learning', 'Data Science', 'Development'], ['Data Science', 'Other IT & Software', 'IT & Software'], ['Data Science', 'Development'], ['Math', 'Data Science', 'Development'], ['Machine Learning', 'Data Science', 'Development']]


In [17]:
min_support = 50
ps = PrefixSpan(sequence_db)

top_100_patterns = ps.topk(100)

In [19]:
for (support, pattern) in top_100_patterns:
    if len(pattern) > 2:
        print(f"support: {support}, pattern: {pattern}")

support: 353, pattern: ['Machine Learning', 'Data Science', 'Development']
support: 297, pattern: ['Python', 'Programming Languages', 'Development']
support: 282, pattern: ['Java', 'Programming Languages', 'Development']
support: 215, pattern: ['React JS', 'Web Development', 'Development']
support: 192, pattern: ['JavaScript', 'Web Development', 'Development']
support: 120, pattern: ['Node.Js', 'Web Development', 'Development']
support: 119, pattern: ['Artificial Intelligence (AI)', 'Data Science', 'Development']
support: 119, pattern: ['Python', 'Data Science', 'Development']
support: 105, pattern: ['Microservices', 'Other IT & Software', 'IT & Software']
support: 104, pattern: ['Generative AI (GenAI)', 'Data Science', 'Development']
support: 104, pattern: ['Natural Language Processing (NLP)', 'Data Science', 'Development']
support: 101, pattern: ['Model Context Protocol (MCP)', 'Other IT & Software', 'IT & Software']
support: 83, pattern: ['Deep Learning', 'Data Science', 'Developmen

## Insight 1: "Specific-to-General" Topic Hierarchies
This is the most obvious and frequent pattern. It shows how Udemy's category system is structured. A course is tagged with a specific topic, which is then followed by its main parent category.

**Support: 1451, Pattern:** `['Web Development', 'Development']`
**Support: 1686, Pattern:** `['Other IT & Software', 'IT & Software']`
**Support: 1007, Pattern:** `['Programming Languages', 'Development']`
**Support: 639, Pattern:** `['Network & Security', 'IT & Software']`

**Meaning:**
This is a clear tagging convention. When a creator lists a course, they are likely prompted to choose a specific topic (Web Development) and then its general "bucket" (Development). Your data proves this is a consistent rule.

---

## Insight 2: Common Interdisciplinary Topics
These patterns show which distinct fields are most often "packaged" together in a single course. This reveals the most popular interdisciplinary skills.

**Support: 1683, Pattern:** `['Data Science', 'Development']`
**Support: 357, Pattern:** `['Machine Learning', 'Data Science']`
**Support: 742, Pattern:** `['Math', 'Teaching & Academics']`
**Support: 211, Pattern:** `['Business Analytics & Intelligence', 'Business']`

**Meaning:**
This shows how courses are marketed.
Courses are not just "Data Science"; they are "Data Science for Developers."
"Machine Learning" courses are most frequently positioned as a part of "Data Science."
"Math" courses on Udemy are very often aimed at "Teachers & Academics," not just pure mathematicians.

---

## Insight 3: Implied "Tech Stacks" & Course Structures
These 3-item patterns are the most valuable. They show a clear "Tool -> Skill -> Domain" logic, which is the closest you'll get to a "learning path" from a creator's perspective.

**Support: 215, Pattern:** `['React JS', 'Web Development', 'Development']`
**Support: 297, Pattern:** `['Python', 'Programming Languages', 'Development']`
**Support: 282, Pattern:** `['Java', 'Programming Languages', 'Development']`
**Support: 353, Pattern:** `['Machine Learning', 'Data Science', 'Development']`

**Meaning:**
This is a "Course DNA" formula.

**Formula 1:** (Tool -> Specific Domain -> General Domain)
React JS (Tool) → Web Development (Specific Domain) → Development (General Domain).
This is the standard "stack" for a React course.

**Formula 2:** (Tool -> Skill Category -> General Domain)
Python (Tool) → Programming Languages (Skill Category) → Development (General Domain).
This tells you that Python and Java courses are primarily framed as general Development tools, not just for one specific purpose.

**Formula 3:** (Sub-field -> Main Field -> General Domain)
Machine Learning (Sub-field) → Data Science (Main Field) → Development (General Domain).
This is the most common "path" for advanced AI/ML topics. It shows they are rooted in Data Science principles and applied in a Development context.

---

## Insight 4: Categorizing Emerging Technologies
Looking at the lower-support patterns for new, "hot" topics tells you where they "fit" in the market.

**Support: 143, Pattern:** `['Artificial Intelligence (AI)', 'Development']`
**Support: 122, Pattern:** `['Artificial Intelligence (AI)', 'Data Science']`
**Support: 119, Pattern:** `['Artificial Intelligence (AI)', 'Data Science', 'Development']`
**Support: 116, Pattern:** `['Generative AI (GenAI)', 'Development']`
**Support: 104, Pattern:** `['Generative AI (GenAI)', 'Data Science']`
**Support: 87, Pattern:** `['Deep Learning', 'Data Science']`
**Support: 86, Pattern:** `['Deep Learning', 'Development']`

**Meaning:**
There is a clear consensus: new AI topics are being categorized as belonging to both Data Science (the theory/field) and Development (the application).
GenAI and Deep Learning aren't their own categories yet; they are specializations within these two main buckets. This is a very useful insight into market positioning.

