# String Manipulation and Regex in Python

In Python, a **string** is a sequence of characters enclosed within either single quotes (' ') or double quotes (" "). Strings are a fundamental data type used to represent text data. They are versatile and offer various methods for manipulation.

**Counting Occurrences**:  
You can count the number of occurrences of a specific character or substring within a string using the `count` method.

**Sub-String Slicing**:  
To extract a sub-string from a string, you can use slicing. Slicing is done using square brackets.

**Joining Strings**:  
To join a list of string objects into a single string with a specified separator, you can use the `join` method.
  

**COLAB LINK**: https://colab.research.google.com/github/samsung-ai-course/6-7-edition/blob/main/NLP/Computers%20dont%20read%20numbers/Exercise%20Notebook.ipynb

In [1]:
import re
import requests

## Exercises

Below are exercises to practice string manipulation, regex, and tokenization. Complete the tasks using the provided functions.

In [None]:
# Exercise 1: Join List into String
words = ["Hello", "world", "Python"]
# Use join method to create a sentence
sentence =  # Your code here
print(sentence)  # Expected output: 'Hello world Python'

In [None]:
# Exercise 2: Count Occurrences
y = "Never gonna let you down"
# Count the number of occurrences of 'a'
count_a =  # Your code here
print(count_a)  # Expected output: 2

In [None]:
# Exercise 3: Regex Substitution
x1 = "The pin code is 1234"
# Substitute the digits with 'X'
modified_x1 = re.sub( # Your regex pattern here, x1)
print(modified_x1)  # Expected output: 'The pin code is XXXX'

In [None]:
# Exercise 4: Regex Clean-up
x2 = "This@is~a#messy_%%%%%%%%%%%%%%%%%%%%%%%%%%%%string"
# Clean special characters and remove excess spaces
cleaned_x2 = re.sub( # Your regex pattern here, x2)
print(cleaned_x2)  # Expected output: 'This is a messy string'

In [None]:
# Exercise 5: Extracting Numbers
def extract_numbers(text):
    # Define a regex pattern to match numbers (integers and floating point)
    pattern = ' # Your regex pattern here
    # Use findall to extract numbers
    numbers = re.findall(pattern, text)
    return numbers
# Test the function
sample_text = 'Call me at 123-456-7890 or visit me at 134.56.789'
result = extract_numbers(sample_text)
print(result)  # Expected output: ['123', '456', '7890', '134', '56', '789']

# Clinical note

In [24]:
#Get clinical note
# repo_location = "https://raw.githubusercontent.com/samsung-ai-course/6-7-edition/refs/heads/main/NLP/Computers%20dont%20read%20numbers/data/"
# files_list = requests.get(repo_location+"directories.txt")
# files_list  = files_list.text.split("\n")
note = requests.get("https://raw.githubusercontent.com/samsung-ai-course/6-7-edition/refs/heads/main/NLP/Computers%20dont%20read%20numbers/data/10001884-DS-34.txt")
note = note.text

In [26]:
print(note)

 
Name:  ___             Unit No:   ___
 
Admission Date:  ___              Discharge Date:   ___
 
Date of Birth:  ___             Sex:   F
 
Service: MEDICINE
 
Allergies: 
IV Dye, Iodine Containing Contrast Media / Oxycodone / 
cilostazol / Varenicline
 
Attending: ___.
 
Chief Complaint:
Shortness of breath
 
Major Surgical or Invasive Procedure:
None

 
History of Present Illness:
Ms. ___ is a ___ yo woman with a PMH notable for COPD on 
home O2(hospitalized ___, multiple recent ED visits), Afib on 
apixaban, HTN, CAD, and HLD who presents with several days of 
worsening dyspnea.

Patient has had several ED visits for dyspnea and a recent 
hospitalization for a COPD exacerbation in ___. She has 
been on steroid therapy with several attempts to taper over the 
last several months. After her most recent ED visit on ___ 
she was on placed on 60 mg PO prednisone with a taper down by 10 
mg each day. Her SOB worsened with the taper and she was seen on 
___ by her PCP who started her on

In [None]:
# Exercise 6: construct a structuring data function
def finding_structure_data(text):
    info = {}
    patterns = {
        "Name": NotImplementedError,
        "Unit No": NotImplementedError,
        "Admission Date": NotImplementedError,
        "Discharge Date": NotImplementedError,
        "Date of Birth": NotImplementedError,
        "Sex": NotImplementedError}
    for key, pattern in patterns.items():
        match = re.search(pattern, text, re.DOTALL)
        info[key] = match.group(1).strip() if match else "None"
    return info

# Tokenization
Tokenization is the process of breaking down a text into smaller units like words or phrases. Below are some exercises on tokenization.

In [None]:
import nltk
from nltk.tokenize import RegexpTokenizer

text = "John's dog, Max, loves chasing after tennis balls in the park. It's his favorite activity!"
# Create a tokenizer
tokenizer = RegexpTokenizer('\w+')
# Tokenize the text
tokens = tokenizer.tokenize(text)
print(tokens)  # Expected output: list of tokens

In [None]:
# Exercise 6: Custom Tokenizer
text = "Hello there! How are you today?"
# Implement another method of tokenization
# Example: WhitespaceTokenizer
whitespace_tokens =  # Your code here
print(whitespace_tokens)  # Expected output: list of tokens based on whitespace

# Sklearn CountVectorizer
Finally, let's use sklearn's CountVectorizer to create a bag of words representation for a sample text.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I wanted the pineapple from the competition.",
    "This pineapple was the ultimate prize.",
    "And the third team stole my pineapple dream.",
    "Did you see the first pineapple at the competition?"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# try get_feature_names_out method to see the columns


#try toarray method to see the representantion of the words

These exercises should help improve your skills in string manipulation, regex, and tokenization using Python!