# String Manipulation and Regex in Python

In Python, a **string** is a sequence of characters enclosed within either single quotes (' ') or double quotes (" "). Strings are a fundamental data type used to represent text data. They are versatile and offer various methods for manipulation.

**Counting Occurrences**:  
You can count the number of occurrences of a specific character or substring within a string using the `count` method.

**Sub-String Slicing**:  
To extract a sub-string from a string, you can use slicing. Slicing is done using square brackets.

**Joining Strings**:  
To join a list of string objects into a single string with a specified separator, you can use the `join` method.


In [None]:
import re

## Exercises

Below are exercises to practice string manipulation, regex, and tokenization. Complete the tasks using the provided functions.

In [None]:
# Exercise 1: Join List into String
words = ["Hello", "world", "Python"]
# Use join method to create a sentence
sentence =  # Your code here
print(sentence)  # Expected output: 'Hello world Python'

In [None]:
# Exercise 2: Count Occurrences
y = "Never gonna let you down"
# Count the number of occurrences of 'a'
count_a =  # Your code here
print(count_a)  # Expected output: 2

In [None]:
# Exercise 3: Regex Substitution
x1 = "The pin code is 1234"
# Substitute the digits with 'X'
modified_x1 = re.sub( # Your regex pattern here, x1)
print(modified_x1)  # Expected output: 'The pin code is XXXX'

In [None]:
# Exercise 4: Regex Clean-up
x2 = "This@is~a#messy_%%%%%%%%%%%%%%%%%%%%%%%%%%%%string"
# Clean special characters and remove excess spaces
cleaned_x2 = re.sub( # Your regex pattern here, x2)
print(cleaned_x2)  # Expected output: 'This is a messy string'

In [None]:
# Exercise 5: Extracting Numbers
def extract_numbers(text):
    # Define a regex pattern to match numbers (integers and floating point)
    pattern = ' # Your regex pattern here
    # Use findall to extract numbers
    numbers = re.findall(pattern, text)
    return numbers
# Test the function
sample_text = 'Call me at 123-456-7890 or visit me at 134.56.789'
result = extract_numbers(sample_text)
print(result)  # Expected output: ['123', '456', '7890', '134', '56', '789']

# Tokenization
Tokenization is the process of breaking down a text into smaller units like words or phrases. Below are some exercises on tokenization.

In [None]:
import nltk
from nltk.tokenize import RegexpTokenizer

text = "John's dog, Max, loves chasing after tennis balls in the park. It's his favorite activity!"
# Create a tokenizer
tokenizer = RegexpTokenizer('\w+')
# Tokenize the text
tokens = tokenizer.tokenize(text)
print(tokens)  # Expected output: list of tokens

In [None]:
# Exercise 6: Custom Tokenizer
text = "Hello there! How are you today?"
# Implement another method of tokenization
# Example: WhitespaceTokenizer
whitespace_tokens =  # Your code here
print(whitespace_tokens)  # Expected output: list of tokens based on whitespace

# Sklearn CountVectorizer
Finally, let's use sklearn's CountVectorizer to create a bag of words representation for a sample text.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I wanted the pineapple from the competition.",
    "This pineapple was the ultimate prize.",
    "And the third team stole my pineapple dream.",
    "Did you see the first pineapple at the competition?"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# try get_feature_names_out method to see the columns


#try toarray method to see the representantion of the words

These exercises should help improve your skills in string manipulation, regex, and tokenization using Python!