# Case Study: School Budgeting with Machine Learning in Python

**DataDriven:** runs online data science challenges for non-profits, NGOs and social enterprises
* Goals of the case study:
    * **Use data to have a social impact**
    * **Build a machine learning algorithm that can automate the school budgeting process**
    
* Introduction to the challenge:
    * NLP
    * Feature Engineering
    * Efficiency Boosting hashing tricks
    * Supervised learning problem (labeled)
    * Over 100 target variables
    * Classification problem: we want to predict a category for each line item
    * Predictions will be probabilities for each label

* A **human-in-the-loop machine-learning system (HITL):** is a branch of artificial intelligence that leverages both human and machine intelligence to create machine learning models.

#### Exploring the Data:
* Encode labels as categories
* ML algorithms work on numbers, not strings
    * Need a numeric representation of these strings
* Strings can be slow compared to numbers
    * We never know ahead of time how long a string is, so our computer has to take more time to process strings than numbers (which have a precise number of bits)
* In pandas, `category` dtype encodes categorical data numerically
    * Can help speed up code
    * use `.astype('category')
    * to see the numerical representation of 'categories' use:
        * `dummies = pd.get_dummies(sample_df[['label']], prefix_sep=' ')`
        * Dummy encoding also called a **`binary indicator` representation**
        
#### Lambda functions
* Alternative to `def` syntax
* Easy way to make simple, one line functions
* `square = lambda x: x*x`
    * $\Uparrow$ We can define a lambda function that takes a paramater (the variable, *x*), the function itself just multiplies *x* by *x* and returns the result.
    * Call this function, just like any other function:
        * `square(2)`
        
* In the budget data, there are multiple columns that need to be made categorical
* To make multiple columns into categories, we need to apply the function to each column separately:
    * `categorize_label = lambda x: x.astype('category')`
    * `sample_df.label =sample_df[['label']].apply(categorize_label, axis = 0)`