<a href="https://colab.research.google.com/github/gabrielnichio/NLP/blob/main/Regex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regex
. : any character except newline.  
\ : match a character that have special meaning in regex (e.g. \\. or \\+).  
[ ] : matches a single character that is contained within the brackets.  
( ) : grouping.  
\- : can be used to represent intervals (e.g. a-z).  
^,& : start and end of the line respectively.  
\+ : one or more matches.  
\* : zero or more matches.  
? : zero or one match.  
{m, n} : m to n both inclusive.  
{m} : exactly m times.  
{m,+} : m or more.



In [1]:
import re
import pandas as pd

In [2]:
portuguese_questions = pd.read_csv('/content/stackoverflow_portugues.csv')
spanish_questions = pd.read_csv('/content/stackoverflow_espanhol.csv')
english_questions = pd.read_csv('/content/stackoverflow_ingles.csv')

In [3]:
english_questions.head(1)

Unnamed: 0,Id,Título,Questão,Tags,Pontuação,Visualizações
0,11227809,Why is it faster to process a sorted array tha...,<p>Here is a piece of C++ code that seems very...,<java><c++><performance><optimization><branch-...,23057,1358574


In [4]:
spanish_questions.head(1)

Unnamed: 0,Id,Título,Questão,Tags,Pontuação,Visualizações
0,18232,¿Cómo evitar la inyección SQL en PHP?,<p>Las sentencias dinámicas son sentencias SQL...,<php><mysql><sql><seguridad><inyección-sql>,169,38614


In [5]:
portuguese_questions.head(1)

Unnamed: 0,Id,Título,Questão,Tags,Pontuação,Visualizações
0,2402,Como fazer hash de senhas de forma segura?,"<p>Se eu fizer o <em><a href=""http://pt.wikipe...",<hash><segurança><senhas><criptografia>,350,22367


In [6]:
# list of all elements matching the regex
# regular expression + element to search
re.findall(r"<.*?>", english_questions['Questão'][0])

['<p>',
 '</p>',
 '<pre class="lang-cpp prettyprint-override">',
 '<code>',
 '</code>',
 '</pre>',
 '<ul>',
 '<li>',
 '<code>',
 '</code>',
 '</li>',
 '<li>',
 '</li>',
 '</ul>',
 '<p>',
 '</p>',
 '<pre class="lang-java prettyprint-override">',
 '<code>',
 '</code>',
 '</pre>',
 '<p>',
 '</p>',
 '<hr>',
 '<p>',
 '</p>',
 '<ul>',
 '<li>',
 '</li>',
 '<li>',
 '</li>',
 '<li>',
 '</li>',
 '</ul>']

In [7]:
re.sub(r"<.*?>", "T---E---S---T", english_questions['Questão'][0])

'T---E---S---THere is a piece of C++ code that seems very peculiar. For some strange reason, sorting the data miraculously makes the code almost six times faster.T---E---S---T\n\nT---E---S---TT---E---S---T#include &lt;algorithm&gt;\n#include &lt;ctime&gt;\n#include &lt;iostream&gt;\n\nint main()\n{\n    // Generate data\n    const unsigned arraySize = 32768;\n    int data[arraySize];\n\n    for (unsigned c = 0; c &lt; arraySize; ++c)\n        data[c] = std::rand() % 256;\n\n    // !!! With this, the next loop runs faster\n    std::sort(data, data + arraySize);\n\n    // Test\n    clock_t start = clock();\n    long long sum = 0;\n\n    for (unsigned i = 0; i &lt; 100000; ++i)\n    {\n        // Primary loop\n        for (unsigned c = 0; c &lt; arraySize; ++c)\n        {\n            if (data[c] &gt;= 128)\n                sum += data[c];\n        }\n    }\n\n    double elapsedTime = static_cast&lt;double&gt;(clock() - start) / CLOCKS_PER_SEC;\n\n    std::cout &lt;&lt; elapsedTime &lt;&l

In [8]:
regex = re.compile(r"<.*?>")

In [None]:
def remove_tags(texts, regex):
  if type(texts) == str:
    return regex.sub(" ", texts)
  else:
    return  [regex.sub(" ", text) for text in texts]

In [10]:
questions = remove_tags(english_questions['Questão'][0], regex)
print(questions)

 Here is a piece of C++ code that seems very peculiar. For some strange reason, sorting the data miraculously makes the code almost six times faster. 

  #include &lt;algorithm&gt;
#include &lt;ctime&gt;
#include &lt;iostream&gt;

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c &lt; arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i &lt; 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c &lt; arraySize; ++c)
        {
            if (data[c] &gt;= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast&lt;double&gt;(clock() - start) / CLOCKS_PER_SEC;

    std::cout &lt;&lt; elapsedTime &lt;&lt; std::endl;
    std::cout &lt;&lt; "sum = " &lt;&lt; sum &lt;&lt; std::endl;
}
  