# String Manipulation and Regex in Python

In Python, a **string** is a sequence of characters enclosed within either single quotes (' ') or double quotes (" "). Strings are a fundamental data type used to represent text data. They are versatile and offer various methods for manipulation.

**Counting Occurrences**:  
You can count the number of occurrences of a specific character or substring within a string using the `count` method.

**Sub-String Slicing**:  
To extract a sub-string from a string, you can use slicing. Slicing is done using square brackets.

**Joining Strings**:  
To join a list of string objects into a single string with a specified separator, you can use the `join` method.
  

**COLAB LINK**: https://colab.research.google.com/github/samsung-ai-course/6-7-edition/blob/main/NLP/Computers%20dont%20read%20numbers/Exercise%20Notebook.ipynb

In [8]:
!pip install nltk
!pip install requests
!pip install regex



In [9]:
import re
import requests

## Exercises

Below are exercises to practice string manipulation, regex, and tokenization. Complete the tasks using the provided functions.

In [12]:
# Exercise 1: Join List into String
words = ["Hello", "world", "Python"]
# Use join method to create a sentence
sentence =  " "
sentence = sentence.join(words)
print(sentence)  # Expected output: 'Hello world Python'

Hello world Python


In [13]:
# Exercise 2: Count Occurrences
y = "Never gonna let you down"
# Count the number of occurrences of 'a'
count_a =  y.count("a")
print(count_a)  # Expected output: 1

1


In [16]:
# Exercise 3: Regex Substitution
x1 = "The pin code is 1234"
# Substitute the digits with 'X'
modified_x1 = re.sub(r'[0-9]', 'X',x1)
print(modified_x1)  # Expected output: 'The pin code is XXXX'

The pin code is XXXX


In [18]:
# Exercise 4: Regex Clean-up
x2 = "This@is~a#messy_%%%%%%%%%%%%%%%%%%%%%%%%%%%%string"
# Clean special characters and remove excess spaces
cleaned_x2 = re.sub(r'[^a-zA-Z0-9]', ' ', x2)
cleaned_x2 = re.sub(r'\s+', ' ', cleaned_x2).strip()
print(cleaned_x2)  # Expected output: 'This is a messy string'

This is a messy string


In [20]:
# Exercise 5: Extracting Numbers
def extract_numbers(text):
    # Define a regex pattern to match numbers (integers and floating point)
    pattern = r'\d+'
    # Use findall to extract numbers
    numbers = re.findall(pattern, text)
    return numbers
# Test the function
sample_text = 'Call me at 123-456-7890 or visit me at 134.56.789'
result = extract_numbers(sample_text)
print(result)  # Expected output: ['123', '456', '7890', '134', '56', '789']

['123', '456', '7890', '134', '56', '789']


# Clinical note

In [30]:
#Get clinical note
# repo_location = "https://raw.githubusercontent.com/samsung-ai-course/6-7-edition/refs/heads/main/NLP/Computers%20dont%20read%20numbers/data/"
# files_list = requests.get(repo_location+"directories.txt")
# files_list  = files_list.text.split("\n")
note = requests.get("https://raw.githubusercontent.com/samsung-ai-course/6-7-edition/refs/heads/main/NLP/Computers%20dont%20read%20numbers/data/10001884-DS-34.txt")
note = note.text

In [31]:
print(note)

 
Name:  ___             Unit No:   ___
 
Admission Date:  ___              Discharge Date:   ___
 
Date of Birth:  ___             Sex:   F
 
Service: MEDICINE
 
Allergies: 
IV Dye, Iodine Containing Contrast Media / Oxycodone / 
cilostazol / Varenicline
 
Attending: ___.
 
Chief Complaint:
Shortness of breath
 
Major Surgical or Invasive Procedure:
None

 
History of Present Illness:
Ms. ___ is a ___ yo woman with a PMH notable for COPD on 
home O2(hospitalized ___, multiple recent ED visits), Afib on 
apixaban, HTN, CAD, and HLD who presents with several days of 
worsening dyspnea.

Patient has had several ED visits for dyspnea and a recent 
hospitalization for a COPD exacerbation in ___. She has 
been on steroid therapy with several attempts to taper over the 
last several months. After her most recent ED visit on ___ 
she was on placed on 60 mg PO prednisone with a taper down by 10 
mg each day. Her SOB worsened with the taper and she was seen on 
___ by her PCP who started her on

In [33]:
# Exercise 6: construct a structuring data function
def finding_structure_data(text):
    info = {}
    patterns = {
        "Name": r"Name:\s+(.*?)(?:Unit No:|$)",
        "Unit No": r"Unit No:\s+(.*?)(?:Admission Date:|$)",
        "Admission Date": r"Admission Date:\s+(.*?)(?:Discharge Date:|$)",
        "Discharge Date": r"Discharge Date:\s+(.*?)(?:Date of Birth:|$)",
        "Date of Birth": r"Date of Birth:\s+(.*?)(?:Sex:|$)",
        "Sex": r"Sex:\s+(.*?)(?:Service:|$)"}
    for key, pattern in patterns.items():
        match = re.search(pattern, text, re.DOTALL)
        info[key] = match.group(1).strip() if match else "None"
    return info

sc = finding_structure_data(note)
print(sc)

{'Name': '___', 'Unit No': '___', 'Admission Date': '___', 'Discharge Date': '___', 'Date of Birth': '___', 'Sex': 'F'}


# Tokenization
Tokenization is the process of breaking down a text into smaller units like words or phrases. Below are some exercises on tokenization.

In [39]:
import nltk
from nltk.tokenize import RegexpTokenizer

text = "John's dog, Max, loves chasing after tennis balls in the park. It's his favorite activity!"
# Create a tokenizer
tokenizer = RegexpTokenizer('\w+')
# Tokenize the text
tokens = tokenizer.tokenize(text)
print(tokens)  # Expected output: list of tokens

['John', 's', 'dog', 'Max', 'loves', 'chasing', 'after', 'tennis', 'balls', 'in', 'the', 'park', 'It', 's', 'his', 'favorite', 'activity']


In [40]:
# Exercise 6: Custom Tokenizer
text = "Hello there! How are you today?"
# Implement another method of tokenization
# Example: WhitespaceTokenizer
whitespace_tokens =  tokenizer.tokenize(text)
print(whitespace_tokens)  # Expected output: list of tokens based on whitespace

['Hello', 'there', 'How', 'are', 'you', 'today']


# Sklearn CountVectorizer
Finally, let's use sklearn's CountVectorizer to create a bag of words representation for a sample text.

In [34]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I wanted the pineapple from the competition.",
    "This pineapple was the ultimate prize.",
    "And the third team stole my pineapple dream.",
    "Did you see the first pineapple at the competition?"
]

vectorizer = CountVectorizer()
vectorizer.fit(corpus)

# try get_feature_names_out method to see the columns
print(vectorizer.get_feature_names_out())

#try toarray method to see the representantion of the words
vectorizer.transform(corpus).toarray()

['and' 'at' 'competition' 'did' 'dream' 'first' 'from' 'my' 'pineapple'
 'prize' 'see' 'stole' 'team' 'the' 'third' 'this' 'ultimate' 'wanted'
 'was' 'you']


array([[0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0],
       [1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       [0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 1]])

These exercises should help improve your skills in string manipulation, regex, and tokenization using Python!

In [42]:
import pandas as pd


df = pd.DataFrame(vectorizer.transform(corpus).toarray(), columns=vectorizer.get_feature_names_out())
df


Unnamed: 0,and,at,competition,did,dream,first,from,my,pineapple,prize,see,stole,team,the,third,this,ultimate,wanted,was,you
0,0,0,1,0,0,0,1,0,1,0,0,0,0,2,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,1,1,0,1,0
2,1,0,0,0,1,0,0,1,1,0,0,1,1,1,1,0,0,0,0,0
3,0,1,1,1,0,1,0,0,1,0,1,0,0,2,0,0,0,0,0,1
