### **1. Removing HTML Tags and Special Characters**

In [1]:
import re

def remove_html_tags(text):
    clean_text = re.sub(r'<.*?>', '', text)
    return clean_text

def remove_special_characters(text):
    clean_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return clean_text

### **2. Tokenization**

Tokenization breaks down text into its constituent parts and facilitates the counting and analysis of words.

**Be carefull : ```word_tokenize(text)``` ≠ ```text.split()```**

In [12]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

text = "I am a good good boy"

def tokenize_text(text):
    tokens = word_tokenize(text)
    return tokens

print(tokenize_text(text))

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\zekin\AppData\Roaming\nltk_data...


['I', 'am', 'a', 'good', 'good', 'boy']


[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


### **3. Lowercasing & Uppercasing**

In [15]:
def convert_to_lowercase(text):
    lowercased_text = text.lower()
    return lowercased_text

def convert_to_lowercase(text):
    lowercased_text = text.upper()
    return lowercased_text

### **4. Stopword Removal**

Stopwords are common words such as “the,” “and,” or “in” that carry little meaningful information in many NLP tasks. Removing stopwords can reduce noise and improve the efficiency of text analysis.

**```remove_stopwords``` expects tokens (a list), if you're passing text (a string). You need to tokenize it first.**

In [20]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return filtered_tokens

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\zekin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [26]:
tokenized_text = tokenize_text("I am a good good boy and the cool guy or you can call me zzippy")

In [27]:
filtered_tokens = remove_stopwords(text)
filtered_tokens

['I', 'good', 'good', 'boy', 'cool', 'guy', 'call', 'zzippy']

### **5. Stemming and Lemmatization**

Stemming and lemmatization are techniques to reduce words to their root forms, which can help group similar words. Stemming is more aggressive and may result in non-dictionary words, whereas lemmatization produces valid words.

In [30]:
import nltk
nltk.download('wordnet')
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

def stem_text(tokens):
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    return stemmed_tokens

def lemmatize_text(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return lemmatized_tokens

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\zekin\AppData\Roaming\nltk_data...


In [41]:
token = word_tokenize("I am running and I used to love swimming, I love you and you loved me!")

In [42]:
stem_text(token)

['i',
 'am',
 'run',
 'and',
 'i',
 'use',
 'to',
 'love',
 'swim',
 ',',
 'i',
 'love',
 'you',
 'and',
 'you',
 'love',
 'me',
 '!']

In [40]:
lemmatize_text(token)

['I',
 'am',
 'running',
 'and',
 'I',
 'used',
 'to',
 'love',
 'swimming',
 ',',
 'I',
 'love',
 'you',
 'and',
 'you',
 'love',
 'me',
 '!']

### **6.Removing Duplicate Text**

In [43]:
def remove_duplicates(token):
    unique_texts = list(set(token))
    return unique_texts

In [44]:
text_2 = remove_duplicates(token)
text_2

['!',
 'you',
 'am',
 'love',
 'running',
 ',',
 'swimming',
 'used',
 'to',
 'loved',
 'and',
 'me',
 'I']

### **7.Token to String**

In [55]:
tokens = ["I", "amm", "a", "gud", "boy"]
strings = " ".join(tokens)
strings

'I amm a gud boy'

### **8.Dealing with Noisy Text**

Noisy text data can include typos, abbreviations, non-standard language usage, and other irregularities. Addressing such noise is crucial for ensuring the accuracy of text analysis. Techniques like spell-checking, correction, and custom rules for specific noise patterns can be applied.

In [52]:
from spellchecker import SpellChecker

def correct_spelling(tokens):
    spell = SpellChecker()
    corrected_tokens = [spell.correction(word) for word in tokens]
    corrected_text = ' '.join(corrected_tokens)
    return corrected_text

In [54]:
tokens = ["I", "amm", "a", "gud", "boy"]
type(correct_spelling(tokens))  # Outputs: "I am a good boy"

str

### **9. Handling Encoding Issues**

Encoding problems can lead to unreadable characters or errors during text processing. Ensuring that text is correctly encoded (e.g., UTF-8) is crucial to prevent issues related to character encoding.

`text.encode('utf-8').decode('utf-8')` converts a string to UTF-8 bytes and then back to a string. It’s a no-op for valid UTF-8 text—output equals input—but can raise errors if the text has invalid encoding, effectively validating it.

In [46]:
def fix_encoding(text):
    try:
        decoded_text = text.encode('utf-8').decode('utf-8')
    except UnicodeDecodeError:
        decoded_text = 'Encoding Error'
    return decoded_text

In [50]:
text = "I am running and I used to love swimming, I love you and you loved me!"
fix_encoding(text)

'I am running and I used to love swimming, I love you and you loved me!'

### **10. Whitespace Removal**

In [56]:
def remove_whitespace(text):
    cleaned_text = ' '.join(text.split())
    return cleaned_text

In [57]:
text = "I am running and I used to love swimming, I love you and you loved me!"
text.split()

['I',
 'am',
 'running',
 'and',
 'I',
 'used',
 'to',
 'love',
 'swimming,',
 'I',
 'love',
 'you',
 'and',
 'you',
 'loved',
 'me!']

### **11. Handling Text Language Identification**

In [59]:
from langdetect import detect

def detect_language(text):
    try:
        language = detect(text)
    except:
        language = 'unknown'
    return language

In [60]:
text = "I am running and I used to love swimming, I love you and you loved me!"
detect_language(text)

'en'

In [62]:
text = "我喜欢你"
detect_language(text)

'ko'

### **12. Handling Text Length Variation**

Text data often varies in length, and extreme variations can affect the performance of text analysis algorithms. Depending on your analysis goals, you may need to normalize text length. Techniques include:

Padding: Adding tokens to shorter text samples to make them equal in length to longer samples. This is commonly used in tasks like text classification, requiring fixed input lengths.
Text Summarization: Reducing the length of longer texts by generating concise summaries can be useful for information retrieval or summarization tasks.


In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

def pad_text_sequences(text_sequences, max_length):
    padded_sequences = pad_sequences(text_sequences, maxlen=max_length, padding='post', truncating='post')
    return padded_sequences

In [None]:
text_sequences = [["I love eating ice cream"],["I am cool"],["I am running and I used to love swimming, I love you and you loved me!"]]

### **13.Handling Biases and Fairness**

In text data, biases related to gender, race, or other sensitive attributes can be present. Addressing these biases is crucial for ensuring fairness in NLP applications. Techniques include debiasing word embeddings and using reweighted loss functions to account for bias.

In [None]:
def debias_word_embeddings(embeddings, gender_specific_words):
    # Implement a debiasing technique to reduce gender bias in word embeddings
    pass

### **14. Handling Large Text Corpora**

When dealing with large text corpora, memory and processing time become critical. Data streaming, batch processing, and parallelization can be applied to clean and process large volumes of text data efficiently.

```num_workers``` processes for parallel execution.

In [1]:
from multiprocessing import Pool

def parallel_process_text(data, cleaning_function, num_workers):
    with Pool(num_workers) as pool:
        cleaned_data = pool.map(cleaning_function, data)
    return cleaned_data

```strip()```: 移除字符串两端的空白字符

In [65]:
def clean_text(text):
    return text.lower().strip()

texts = ["Hello World ", "Python ROCKS", " Multiprocessing "]
result = parallel_process_text(texts, clean_text, 2)

### **15. Handling Multilingual Text Data**

Text data can be multilingual, which adds a layer of complexity. Applying language-specific cleaning and preprocessing techniques is important when dealing with multilingual text. Libraries like spaCy and NLTK support multiple languages and can be used to tokenize, lemmatize, and clean text in various languages.

In [5]:
import spacy

def clean_multilingual_text(text, language_code):
    # 加载指定语言模型
    nlp = spacy.load(language_code)  
    # 处理文本生成文档对象
    doc = nlp(text)  
    # 提取每个词的词干并用空格连接
    cleaned_text = ' '.join([token.lemma_ for token in doc])  
    return cleaned_text

In [7]:
text = "I am running and jumping"
result = clean_multilingual_text(text, "en_core_web_sm")
result

'I be run and jump'

### **16. Handling Text Data with Long Documents**

Long documents, such as research papers or legal documents, can pose challenges in text analysis due to their length. Techniques like text summarization or document chunking can extract key information or break long documents into manageable sections for analysis.

In [12]:
from transformers import pipeline

# 加载摘要模型
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


AI.    ‘ ’   ‘’ ’’”’.  “”  ”” ‘. ” ”. “.  ” “’ ”.   “"” "”"’" "“ ’A’: “I’m sorry.”


In [13]:
# 输入文本
text = "Artificial intelligence has rapidly advanced, transforming industries like healthcare, finance, education, and transportation. Deep learning breakthroughs in image recognition and NLP have enabled self-driving cars and smart assistants. However, AI raises ethical and privacy issues, such as data misuse and bias. Experts predict the AI market will exceed $1 trillion by 2030, outpacing regulatory efforts. Automation boosts efficiency but causes job losses, while AI aids drug development and climate modeling. Public opinion is divided, seeing AI as both progress and risk. Its future holds potential and challenges, requiring a balance of innovation and responsibility."
# 生成摘要
# max_length=130: 生成摘要的最大长度（词或标记数），不超过130。
# min_length=30: 生成摘要的最小长度（词或标记数），不少于30。
# do_sample=False: 是否使用采样生成摘要，False表示用确定性方法（贪婪或束搜索），结果更稳定。
summary = summarizer(text, max_length=130, min_length=30, do_sample=False)
print(summary[0]['summary_text'])

Your max_length is set to 130, but your input_length is only 120. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=60)


Experts predict the AI market will exceed $1 trillion by 2030. Public opinion is divided, seeing AI as both progress and risk. Its future holds potential and challenges, requiring a balance of innovation and responsibility.


### **17. Handling Text Data with Time References**

Text data that includes time references, such as dates or timestamps, may require special handling. You can extract and standardize time-related information, convert it to a standard format, or use it to create time series data for temporal analysis.

In [10]:
def extract_dates_and_times(text):
    # Implement date and time extraction logic (e.g., using regular expressions)
    pass