# Lecture 2.1: Working with String Data Types (str accessor, methods) in Pandas

## Introduction
In this lecture, we will dive into the world of string data types in Pandas and explore the powerful `str` accessor and its associated methods. Pandas provides a comprehensive set of string manipulation tools that make working with text data a breeze. We will also showcase practical examples using text data from the scikit-learn library.

## The `str` Accessor
The `str` accessor in Pandas allows you to access a wide range of string-related methods and properties for Series and DataFrame columns with string data types. This accessor provides a convenient way to perform common string operations, such as:

- Extracting substrings
- Performing string transformations
- Checking string patterns
- Splitting and joining strings
- And much more

To access the `str` accessor, you can simply use the dot notation on a Series or DataFrame column, like this: `df['column_name'].str.method()`.

## Common `str` Methods
Here are some of the most commonly used `str` methods in Pandas:

1. **`len()`**: Returns the length of each string in the Series or DataFrame column.
2. **`lower()`** and **`upper()`**: Convert strings to lowercase or uppercase, respectively.
3. **`replace()`**: Replaces occurrences of a pattern with a new string.
4. **`split()`**: Splits strings at the specified separator and returns a Series of lists.
5. **`cat()`**: Concatenates strings in the Series or DataFrame column.
6. **`contains()`**: Checks if each string contains the specified pattern and returns a boolean Series.
7. **`strip()`**, **`lstrip()`**, and **`rstrip()`**: Remove leading and/or trailing whitespace.
8. **`slice()`**, **`slice_replace()`**, and **`extract()`**: Extract substrings.

## Practical Examples
Let's dive into some practical examples using text data from the scikit-learn library. We'll be working with the `20_newsgroup` dataset, which contains a collection of newsgroup documents.

In [5]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# Load the 20 newsgroup dataset
news = fetch_20newsgroups(subset='all')
data = pd.DataFrame({'text': news.data})

1. **Counting the number of characters in each text:**

In [6]:
data['char_count'] = data['text'].str.len()
print(data['char_count'].head())

0     902
1     963
2    3780
3    3096
4     910
Name: char_count, dtype: int64


2. **Converting all text to lowercase:**

In [7]:
data['text_lower'] = data['text'].str.lower()
print(data['text_lower'].head())

0    from: mamatha devineni ratnam <mr47+@andrew.cm...
1    from: mblawson@midway.ecn.uoknor.edu (matthew ...
2    from: hilmi-er@dsv.su.se (hilmi eren)\nsubject...
3    from: guyd@austin.ibm.com (guy dawson)\nsubjec...
4    from: alexander samuel mcdiarmid <am2o+@andrew...
Name: text_lower, dtype: object


3. **Replacing all occurrences of the word 'the' with 'a':**

In [8]:
data['text_replaced'] = data['text'].str.replace('the', 'a')
print(data['text_replaced'].head())

0    From: Mamatha Devineni Ratnam <mr47+@andrew.cm...
1    From: mblawson@midway.ecn.uoknor.edu (Mataw B ...
2    From: hilmi-er@dsv.su.se (Hilmi Eren)\nSubject...
3    From: guyd@austin.ibm.com (Guy Dawson)\nSubjec...
4    From: Alexander Samuel McDiarmid <am2o+@andrew...
Name: text_replaced, dtype: object


4. **Splitting the text into a list of words:**

In [9]:
data['text_split'] = data['text'].str.split()
print(data['text_split'].head())

0    [From:, Mamatha, Devineni, Ratnam, <mr47+@andr...
1    [From:, mblawson@midway.ecn.uoknor.edu, (Matth...
2    [From:, hilmi-er@dsv.su.se, (Hilmi, Eren), Sub...
3    [From:, guyd@austin.ibm.com, (Guy, Dawson), Su...
4    [From:, Alexander, Samuel, McDiarmid, <am2o+@a...
Name: text_split, dtype: object


5. **Checking if each text contains the word 'python':**

In [10]:
data['contains_python'] = data['text'].str.contains('python')
print(data['contains_python'].head())

0    False
1    False
2    False
3    False
4    False
Name: contains_python, dtype: bool


6. **Extracting the first 50 characters of each text:**

In [12]:
data['text_extract'] = data['text'].str.slice(stop=10)
print(data['text_extract'].head())

0    From: Mama
1    From: mbla
2    From: hilm
3    From: guyd
4    From: Alex
Name: text_extract, dtype: object
