### Feature Engineering Best Practices: Handling Text Data
**Question**: Load a dataset with text data (e.g., SMS Spam Collection), perform text
preprocessing, and extract numerical features using TF-IDF.

In [None]:
# write your code from here

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Example dataset: SMS Spam Collection
data = {
    'Message': [
        'Free entry in 2 a weekly competition to win FA Cup final tickets',
        'Hey, are we still meeting for lunch today?',
        'Congratulations! You have won a $1000 Walmart gift card.',
        'Can you call me back when you get a chance?',
        'URGENT! Your mobile number has won £2000 in a prize draw.'
    ],
    'Label': ['Spam', 'Ham', 'Spam', 'Ham', 'Spam']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Text preprocessing and TF-IDF feature extraction
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_features = tfidf_vectorizer.fit_transform(df['Message'])

# Convert TF-IDF features to a DataFrame for better visualization
tfidf_df = pd.DataFrame(tfidf_features.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

print(tfidf_df)

       1000      2000      card  chance  competition  congratulations  \
0  0.000000  0.000000  0.000000     0.0     0.333333         0.000000   
1  0.000000  0.000000  0.000000     0.0     0.000000         0.000000   
2  0.420669  0.000000  0.420669     0.0     0.000000         0.420669   
3  0.000000  0.000000  0.000000     1.0     0.000000         0.000000   
4  0.000000  0.387757  0.000000     0.0     0.000000         0.000000   

        cup      draw     entry        fa  ...    mobile    number     prize  \
0  0.333333  0.000000  0.333333  0.333333  ...  0.000000  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000   
2  0.000000  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000   
3  0.000000  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000   
4  0.000000  0.387757  0.000000  0.000000  ...  0.387757  0.387757  0.387757   

    tickets  today    urgent   walmart    weekly       win       won  
0  0.3333