# Pre-Processing Pipeline

Main libraries: **sklearn.preprocessing**

#### Working with categorical data:

In [None]:
#Making data categorical with order
Cat_data1 = pd.Categorical(data, categories = ['A','B','C'], ordered=True)

#using df.cat.method_name to be efficient
df['column'].cat.categories #lists all categories in the column
df['column'].cat.reorder_categories(new_categories=['cat1','cat2','cat3'], ordered=True, inplace=True)

df['column'].cat.set_categories(new_categories=['cat1','cat2','cat3'], ordered=True, inplace=True) #anything not listed here will have NaN values

df['column'].cat.add_categories(new_categories=['cat4','cat5']) #adds new categories
df['column'].cat.remove_categories(removals=['cat4']) #removes categories. Will set cat4 to NaN values

df['column'].cat.codes #makes codes as per alhpabetical order for each category

**Managing Categorical Variables**
- Always encode AFTER splitting data
- sklearn will not accept categorical variables
- dummy variables 1 or 0 only. But can make multiple columns for many categories


1. sklearn - LabelEncoder() - will give a number to each class in single column
2. sklearn - OneHotEncoder() - one hot encoding - or binary encoding - one column per class, 0 or 1
3. pandas - get_dummies()
4. Using DictVectorizer() can do one hot encoding in one step. 

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

le = LabelEncoder()
ohe = OneHotEncoder(sparse=False)


X_train['categorical_column_le'] = le.fit_transform(X_train['categorical_column'])

# Makes multiple columns for each categorical variable or dummy variables
dummies = pd.get_dummies(df['column'], drop_first=True) 
#first value is dropped because we only p-1 features when all columns are zero, it is its own category

from sklearn.feature_extraction import DictVectorizer

**Imputation and Pipelines** aka **Transformers**
- Always SPLIT data first before imputing to reduce _data leakage_

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

imp_mean = SimpleImputer(strategy='mean')
model = Model()

steps = [("imputer", imp_mean),
        ("model", model)]

pipeline = Pipeline(steps)

pipeline.fit(X_train,y_train)
pipeline.predict(X_test).....

###################WHEN ONLY USING SIMPLEIMPUTE################
f
X_train =  imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

##### Centering and Scaling
- Standardizing: center around zero and variance of 1.
- Min Zero Max One: Minus minimum and divide by range
- Normalize: ??

Always SPLIT data first before imputing to reduce data leakage

##### StandardScaler

Standardize features by removing the mean and scaling to unit variance.

The standard score of a sample x is calculated as:

z = (x - u) / s
where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

- **fit(X[, y, sample_weight])** - Compute the mean and std to be used for later scaling.

- **fit_transform(X[, y])** - Fit to data, then transform it.

- Always fit_transform the training set and ONLY transform the test set. 



In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

#split data first then
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#'''Can also be used in pipelines'''

# Build the steps
steps = [("scaler", StandardScaler()),
         ("logreg", LogisticRegression())]
pipeline = Pipeline(steps)

# Create the parameter space
parameters = {"C": np.linspace(0.001, 1.0, num=20)}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

# Instantiate the grid search object
cv = GridSearchCV(pipeline, param_grid=parameters)

# Fit to the training data
cv.fit(X_train, y_train)
print(cv.best_score_, "\n", cv.best_params_)


## Text Vectorization

- TfidVectorizer from sklearn.feature_extraction.text will find the most important words from text

In [None]:
#####STRINGS#########
# Write a pattern to extract numbers and decimals
def return_mileage(length):
    
    # Search the text for matches
    mile = re.search('\d+\.\d+', length)
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group(0))
        
# Apply the function to the Length column and take a look at both columns
hiking["Length_num"] = hiking['Length'].apply(return_mileage)
print(hiking[["Length", "Length_num"]].head())

########TEXT##############


from sklearn.feature_extraction.text import TfidfVectorizer

tfid_vec = TfidfVectorizer()
text_vec = tfid_vec.fit_transform(documents)

X_train, X_test, y_train, y_test = train_test_split(text_vec.toarray(), y, stratify=y, random_state=42) 
# IMPORTANT have to convert the vector back into array for sklearn