# Component 0 â€” Data Acquisition & Quality Filtering (The Data Pipeline)

This initial notebook establishes the project's foundational data pipeline.

Objective: To acquire a dataset of high-signal, relevant language-learning apps and apply aggressive filtering to ensure maximum data quality before modeling.

**Key Steps & Deliverables:**

1. Targeted Scraping: Used the [google_play_scraper API](https://pypi.org/project/google-play-scraper/) with expanded, high-signal search queries (e.g., 'pronunciation app', 'language tutor') to maximize the initial recall of relevant apps.

2. Quality Filtering (Post-Acquisition): Applied a domain-specific keyword filter to the full text (title + description) to ensure every app is explicitly focused on language learning, removing irrelevant noise from the broader 'Education' category.

3. Initial Feature Engineering: Created structured features essential for later modeling:

4. realInstalls and score (cleaned and converted to numeric types).

install_tier (A binned classification target, showing High/Medium/Low popularity).

Final Output: A cleaned, unique dataset ready for the detailed Exploratory Data Analysis (EDA) in Notebook 01.

In [None]:
import pandas as pd
from google_play_scraper import search, app
from datetime import datetime
import time
import numpy as np
import re

# --- 1. CONFIGURATION AND KEYWORD DEFINITIONS ---

# Expanded list of queries for higher recall during scraping
SEARCH_QUERIES = [
    'language learning', 'learn english', 'learn spanish', 'learn french',
    'learn german', 'learn japanese', 'learn chinese', 'learn korean',
    'learn italian', 'learn russian', 'learn portuguese', 'learn arabic',
    'vocabulary', 'language tutor', 'language practice',
    'pronunciation', 'language flashcards', 'language exchange',
    'foreign languages', 'bilingual'
]

# Apps not specifically for language learning but potentially useful
CATEGORY_QUERIES = {
    'LLM': ['ai chatbot', 'chatgpt', 'gemini'], 
    'NOTEBOOK': ['note taking', 'notebook'], 
    'DICTIONARY&TRANSLATION': ['dictionary', 'translation']
}

# --- 2. DATA ACQUISITION FUNCTION ---

def scrape_apps():
    """Scrapes app data using the google_play_scraper library."""
    app_data = []
    scraped_app_ids = set()  # Track scraped app IDs for faster lookup
    
    print(f"Starting scrape across {len(SEARCH_QUERIES)} main queries...")
    for query in SEARCH_QUERIES:
        print(f"  > Searching for: {query}")
        try:
            results = search(
                query,
                lang='en',
                country='us',
                n_hits=1000
            )
            for res in results:
                app_id = res['appId']
                if app_id in scraped_app_ids:
                    continue
                try:
                    details = app(app_id)
                    details['category'] = 'LANGUAGE_LEARNING'  # Tag main query apps
                    app_data.append(details)
                    scraped_app_ids.add(app_id)
                    time.sleep(0.5)
                except Exception as e:
                    pass
        except Exception as e:
            print(f"Error during search for {query}: {e}")
    
    # Scrape category-specific queries
    for category, queries in CATEGORY_QUERIES.items():
        print(f"  > Searching for category: {category}")
        for query in queries:
            print(f"    > Query: {query}")
            try:
                results = search(
                    query,
                    lang='en',
                    country='us',
                    n_hits=500
                )
                for res in results:
                    app_id = res['appId']
                    if app_id in scraped_app_ids:
                        continue
                    try:
                        details = app(app_id)
                        details['category'] = category
                        app_data.append(details)
                        scraped_app_ids.add(app_id)
                        time.sleep(0.5)
                    except Exception as e:
                        pass
            except Exception as e:
                print(f"Error during search for {query}: {e}")
    
    df = pd.DataFrame(app_data)
    df.drop_duplicates(subset=['appId'], inplace=True)
    print(f"Finished scraping. Total unique apps: {len(df)}.")
    return df


In [None]:
# 1. Scrape data
df = scrape_apps()
df.head()

Starting scrape across 20 main queries...
  > Searching for: language learning
  > Searching for: learn english
  > Searching for: learn spanish
  > Searching for: learn french
  > Searching for: learn german
  > Searching for: learn japanese
  > Searching for: learn chinese
  > Searching for: learn korean
  > Searching for: learn italian
  > Searching for: learn russian
  > Searching for: learn portuguese


In [None]:

# 3. Save to CSV
file_path= f'language_apps.csv'
df.to_csv(file_path, index=False)

print("-" * 50)
print(f"Final CLEANED DataFrame has {len(df)} unique apps.")
print(f"Saved to {file_path}")



--------------------------------------------------
Final CLEANED DataFrame has 521 unique apps.
Saved to language_apps_20251202_100340.csv
