# **Handling Strings**

## **RegEx Cheat Sheet**

<!DOCTYPE html>
<html>
<head>
<style>
table {
  width: 100%;
  border-collapse: collapse;
}
th, td {
  border: 1px solid #ddd;
  padding: 8px;
  text-align: left;
}
th {
  background-color: #f2f2f2;
}
</style>
</head>
<body>

<table>
  <thead>
    <tr>
      <th>Character/Pattern</th>
      <th>Description</th>
      <th>Example</th>
      <th>Matches in "SKU-987: Mouse (Used)"</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code>^</code></td>
      <td><strong>Start of a string.</strong></td>
      <td><code>^SKU</code></td>
      <td><code>SKU</code></td>
    </tr>
    <tr>
      <td><code>$</code></td>
      <td><strong>End of a string.</strong></td>
      <td><code>Used)</code></td>
      <td><code>Used)</code></td>
    </tr>
    <tr>
      <td><code>.</code></td>
      <td><strong>Any single character.</strong></td>
      <td><code>s.e</code></td>
      <td><code>se</code> in <code>Mouse</code></td>
    </tr>
    <tr>
      <td><code>\d</code></td>
      <td><strong>Any digit</strong> (0-9).</td>
      <td><code>\d+</code></td>
      <td><code>987</code></td>
    </tr>
    <tr>
      <td><code>\w</code></td>
      <td><strong>Any "word" character</strong> (a-z, A-Z, 0-9, _).</td>
      <td><code>\w\w\w</code></td>
      <td><code>SKU</code>, <code>987</code>, <code>Mou</code></td>
    </tr>
    <tr>
      <td><code>\s</code></td>
      <td><strong>Any whitespace character</strong> (space, tab, newline).</td>
      <td><code>Mouse\s</code></td>
      <td><code>Mouse </code></td>
    </tr>
    <tr>
      <td><code>[...]</code></td>
      <td><strong>A set of characters.</strong> Matches any one character inside the brackets.</td>
      <td><code>[sS]KU</code></td>
      <td><code>SKU</code></td>
    </tr>
    <tr>
      <td><code>[^...]</code></td>
      <td><strong>Negated set.</strong> Matches any single character <em>not</em> in the set.</td>
      <td><code>[^:]</code></td>
      <td>All characters <em>except</em> the colon <code>:</code></td>
    </tr>
    <tr>
      <td><code>|</code></td>
      <td><strong>Logical OR.</strong> Matches the pattern before or after the pipe.</td>
      <td><code>Mouse|Keyboard</code></td>
      <td><code>Mouse</code></td>
    </tr>
    <tr>
      <td><code>(...)</code></td>
      <td><strong>Grouping.</strong> Groups patterns together.</td>
      <td><code>(\w+)-(\d+)</code></td>
      <td>Captures <code>SKU</code> and <code>987</code> separately</td>
    </tr>
    <tr>
      <td><code>{n}</code></td>
      <td><strong>Quantifier: exactly n times.</strong></td>
      <td><code>\d{3}</code></td>
      <td><code>987</code></td>
    </tr>
    <tr>
      <td><code>{n,}</code></td>
      <td><strong>Quantifier: at least n times.</strong></td>
      <td><code>\d{2,}</code></td>
      <td><code>987</code></td>
    </tr>
    <tr>
      <td><code>*</code></td>
      <td><strong>Quantifier: zero or more</strong> of the preceding character.</td>
      <td><code>a*</code> in <code>caat</code></td>
      <td><code>aa</code></td>
    </tr>
    <tr>
      <td><code>+</code></td>
      <td><strong>Quantifier: one or more</strong> of the preceding character.</td>
      <td><code>o+</code> in <code>booook</code></td>
      <td><code>oooo</code></td>
    </tr>
    <tr>
      <td><code>?</code></td>
      <td><strong>Quantifier: zero or one</strong> of the preceding character (optional).</td>
      <td><code>(Used)?</code></td>
      <td><code>(Used)</code></td>
    </tr>
  </tbody>
</table>

</body>
</html>

In [53]:
import pandas as pd
import numpy as np

In [8]:
data = {
    "fruits": ["apple", "orange   ", "   banada", "pineapple", "grape"]
}
fruits = pd.DataFrame(data)
print(fruits)

      fruits
0      apple
1  orange   
2     banada
3  pineapple
4      grape


### **String Methods**

In [9]:
# clean whitespaces
fruits["fruits"].str.strip()

0        apple
1       orange
2       banada
3    pineapple
4        grape
Name: fruits, dtype: object

In [10]:
# mayus
fruits["fruits"].str.upper()

0        APPLE
1    ORANGE   
2       BANADA
3    PINEAPPLE
4        GRAPE
Name: fruits, dtype: object

In [11]:
# lower case
fruits["fruits"].str.lower()

0        apple
1    orange   
2       banada
3    pineapple
4        grape
Name: fruits, dtype: object

In [12]:
# title way
fruits["fruits"].str.title()

0        Apple
1    Orange   
2       Banada
3    Pineapple
4        Grape
Name: fruits, dtype: object

In [15]:
fruits_cleaned = fruits["fruits"].str.strip().str.title()
print(fruits_cleaned)

0        Apple
1       Orange
2       Banada
3    Pineapple
4        Grape
Name: fruits, dtype: object


In [18]:
# left
fruits["fruits"].str[:2]

0    ap
1    or
2      
3    pi
4    gr
Name: fruits, dtype: object

In [20]:
# right - 3 characters from right
fruits["fruits"].str[-3:]

0    ple
1       
2    ada
3    ple
4    ape
Name: fruits, dtype: object

In [22]:
# mid
fruits["fruits"].str[2: 5]

0    ple
1    ang
2     ba
3    nea
4    ape
Name: fruits, dtype: object

### **Exercise**

Given the next data, create a dataframe with columns sku, name_procut, and status

In [16]:
data = {
    'product_info': [
        'SKU-10025: Laptop (New)', 
        'SKU-987: Mouse (Used-fair)', 
        'SKU-4001: Keyboard-wireless', 
        'SKU-2055: "Monitor" (Refurbished)',
        'SKU-99: "Webcam" - new',
        'SKU-10025: Headphones new' # Duplicate SKU
    ]
}
df = pd.DataFrame(data)
df

Unnamed: 0,product_info
0,SKU-10025: Laptop (New)
1,SKU-987: Mouse (Used-fair)
2,SKU-4001: Keyboard-wireless
3,"SKU-2055: ""Monitor"" (Refurbished)"
4,"SKU-99: ""Webcam"" - new"
5,SKU-10025: Headphones new


In [32]:
df["sku"] = df["product_info"].str.split(":").str[0].str.split("-").str[1]

In [39]:
df["product_name"] = df["product_info"].str.split(":").str[1].str.split(" ").str[1].str.replace('"', '')

In [58]:
df["status"] = df["product_info"].str.split(" ").str[-1].str\
    .replace("(", "").str.replace(")", "")

In [62]:
df["status"] = df["status"].str.replace("Keyboard-wireless", "Not available")

In [63]:
df

Unnamed: 0,product_info,sku,product_name,status
0,SKU-10025: Laptop (New),10025,Laptop,New
1,SKU-987: Mouse (Used-fair),987,Mouse,Used-fair
2,SKU-4001: Keyboard-wireless,4001,Keyboard-wireless,Not available
3,"SKU-2055: ""Monitor"" (Refurbished)",2055,Monitor,Refurbished
4,"SKU-99: ""Webcam"" - new",99,Webcam,new
5,SKU-10025: Headphones new,10025,Headphones,new
