## Extracción de datos de Wikipedia a través de una API

[documentacion](https://www.mediawiki.org/wiki/API:Main_page)

#### _[This page is intended for technical contributors and software developers...]_ 

----

In [1]:
import requests
import pandas as pd

### Búsqueda por título

In [2]:
url = 'https://en.wikipedia.org/w/api.php?action=query&format=json&titles=Xoloitzcuintle&prop=extracts'

request_wiki = requests.get(url)

request_wiki.status_code

200

In [3]:
request_wiki.json().keys()



In [4]:
#request_wiki.json()['query']

In [5]:
request_wiki.json()['query'].keys()

dict_keys(['pages'])

In [6]:
request_wiki.json()['query']['pages'].keys()

dict_keys(['243549'])

In [7]:
request_wiki.json()['query']['pages']['243549'].keys()

dict_keys(['pageid', 'ns', 'title', 'extract'])

In [8]:
#request_wiki.json()['query']['pages']['243549']

In [9]:
df = pd.json_normalize(request_wiki.json()['query'])

In [10]:
df = pd.json_normalize(request_wiki.json()['query']['pages'])

In [11]:
df = pd.json_normalize(request_wiki.json()['query']['pages']['243549'])
df

Unnamed: 0,pageid,ns,title,extract
0,243549,0,Xoloitzcuintle,<p>The <b>Xoloitzcuintle</b> (or <b>Xoloitzqui...


In [12]:
#df['extract'][0]

In [13]:
#df = pd.json_normalize(request_wiki.json()['query'])
#df['pages.243549.extract'][0]

In [14]:
buscar_titulo = 'Xoloitzcuintle'
 
#endpoint = 'https://es.wikipedia.org/w/api.php'
endpoint = 'https://en.wikipedia.org/w/api.php'

params = {
            'action' : 'query',
            'format' : 'json',
            'titles' : buscar_titulo, 
            'prop' : 'extracts'
        }

request2_wiki = requests.get(endpoint, params=params)

request2_wiki.status_code

200

In [15]:
#request2_wiki.json()['query']

In [16]:
df = pd.json_normalize(request2_wiki.json()['query'])
#df['pages.243549.extract'][0]

Restringir la búsqueda:
- exintro : solo el resumen
- exchars : número de caracteres
- exsentences : número de oraciones
- explaintext : texto plano (no HTML)

In [17]:
buscar_titulo = 'Xoloitzcuintle'
 
#endpoint = 'https://es.wikipedia.org/w/api.php'
endpoint = 'https://en.wikipedia.org/w/api.php'


params = {
            'action' : 'query',
            'format' : 'json',
            'titles' : buscar_titulo, 
            'prop' : 'extracts',
            'exintro': True,
            'explaintext': True
        }

request3_wiki = requests.get(endpoint, params=params)

request3_wiki.status_code

200

In [18]:
df = pd.json_normalize(request3_wiki.json()['query'])
resumen_xolo = df['pages.243549.extract'][0]

In [19]:
resumen_xolo

"The Xoloitzcuintle (or Xoloitzquintle, Xoloitzcuintli, or Xolo) is one of several breeds of hairless dog. It is found in standard, intermediate, and miniature sizes. The Xolo also comes in a coated variety, totally covered in fur. Coated and hairless can be born in the same litter as a result of the same combination of genes. The hairless variant is known as the Perro pelón mexicano or Mexican hairless dog. It is characterized by its duality, wrinkles, and dental abnormalities, along with a primitive temper. In Nahuatl, from which its name originates, it is xōlōitzcuintli [ʃoːloːit͡sˈkʷint͡ɬi] (singular) and xōlōitzcuintin [ʃoːloːit͡sˈkʷintin] (plural). The name comes from the god Xolotl that, according to ancient narratives, is its creator and itzcuīntli [it͡sˈkʷiːnt͡ɬi], meaning 'dog' in the Nahuatl language.\n\n"

---
### Búsqueda por palabra en título

In [20]:
buscar_en_titulo = 'Leon'
 
endpoint = 'https://es.wikipedia.org/w/api.php'

params = {
            'action' : 'query',
            'format' : 'json',
            'list':'search',
            'srsearch' : buscar_en_titulo
        }

request4_wiki = requests.get(endpoint, params=params)

request4_wiki.status_code

200

In [21]:
#request4_wiki.json()['query']['search']
df = pd.json_normalize(request4_wiki.json()['query']['search'])
df

Unnamed: 0,ns,title,pageid,size,wordcount,snippet,timestamp
0,0,LEON,766618,1824,161,"<span class=""searchmatch"">LEON</span> es un nú...",2022-12-13T14:18:14Z
1,0,Panthera leo,25116,146793,17903,"El <span class=""searchmatch"">león</span> (Pant...",2023-04-25T08:50:13Z
2,0,Léon,232172,31324,3716,"<span class=""searchmatch"">Léon</span>, también...",2023-03-04T18:05:25Z
3,0,León (España),3341,233782,25540,"<span class=""searchmatch"">León</span> (en <spa...",2023-04-26T19:37:44Z
4,0,Castilla y León,485,194246,20428,"Castilla y <span class=""searchmatch"">León</spa...",2023-04-29T10:02:47Z
5,0,León (heráldica),1743552,18355,949,"En Heráldica el <span class=""searchmatch"">León...",2023-05-01T17:22:17Z
6,0,Nuevo León,25593,119632,9564,"Nuevo <span class=""searchmatch"">León</span> ( ...",2023-04-25T03:30:11Z
7,0,Provincia de León,23530,161746,16488,"<span class=""searchmatch"">León</span> (en <spa...",2023-04-08T08:59:03Z
8,0,Reino de León,41339,32095,3709,"El reino de <span class=""searchmatch"">León</sp...",2023-04-27T03:14:13Z
9,0,León de Los Aldama,66887,162208,20499,"<span class=""searchmatch"">León</span> de los A...",2023-04-26T14:12:31Z


---
---

### [REST API](https://www.mediawiki.org/wiki/API:REST_API/Reference#Python_3)

In [22]:
buscar_en_titulo = 'Leon'
 
endpoint = 'https://es.wikipedia.org/w/rest.php/v1/search/title'

params = {
            'q' : buscar_en_titulo,
            'limit': 10
        }

request5_wiki = requests.get(endpoint, params=params)

request5_wiki.status_code

200

In [23]:
df = pd.json_normalize(request5_wiki.json()['pages'])
df

Unnamed: 0,id,key,title,excerpt,matched_title,description,thumbnail.mimetype,thumbnail.size,thumbnail.width,thumbnail.height,thumbnail.duration,thumbnail.url
0,25116,Panthera_leo,Panthera leo,Leon,Leon,mamífero carnívoro de la familia de los félidos,image/jpeg,,60,45,,//upload.wikimedia.org/wikipedia/commons/thumb...
1,7236,Leonardo_da_Vinci,Leonardo da Vinci,Leonardo da Vinci,,polímata italiano,image/jpeg,,60,94,,//upload.wikimedia.org/wikipedia/commons/thumb...
2,7057,Leonhard_Euler,Leonhard Euler,Leonhard Euler,,matemático nacido en Suiza,image/jpeg,,60,75,,//upload.wikimedia.org/wikipedia/commons/thumb...
3,218655,Leonardo_DiCaprio,Leonardo DiCaprio,Leonardo DiCaprio,,actor y productor cinematográfico estadounidense,image/jpeg,,60,95,,//upload.wikimedia.org/wikipedia/commons/thumb...
4,74170,Leonid_Brézhnev,Leonid Brézhnev,Leonid Brézhnev,,político soviético,image/jpeg,,60,84,,//upload.wikimedia.org/wikipedia/commons/thumb...
5,66278,Leonel_Fernández,Leonel Fernández,Leonel Fernández,,"político, escritor y abogado dominicano",image/jpeg,,60,76,,//upload.wikimedia.org/wikipedia/commons/thumb...
6,220964,Leonor_de_Borbón,Leonor de Borbón,Leonor de Borbón,,XXXVII Princesa de Asturias (2014-presente).,image/jpeg,,60,80,,//upload.wikimedia.org/wikipedia/commons/thumb...
7,468925,Leonardo_Favio,Leonardo Favio,Leonardo Favio,,artista argentino,image/jpeg,,60,82,,//upload.wikimedia.org/wikipedia/commons/thumb...
8,527170,Leonel_Álvarez,Leonel Álvarez,Leonel Álvarez,,futbolista colombiano,image/jpeg,,60,97,,//upload.wikimedia.org/wikipedia/commons/thumb...
9,82604,Leonard_Cohen,Leonard Cohen,Leonard Cohen,,"Poeta, escritor y cantautor canadiense",image/jpeg,,60,89,,//upload.wikimedia.org/wikipedia/commons/thumb...


In [24]:
buscar_titulo = 'Xoloitzcuintle'
 
endpoint = 'https://en.wikipedia.org/w/rest.php/v1/page/' + buscar_titulo

request6_wiki = requests.get(endpoint)

request6_wiki.status_code

200

In [25]:
request6_wiki.json().keys()

dict_keys(['id', 'key', 'title', 'latest', 'content_model', 'license', 'source'])

In [26]:
df = pd.json_normalize(request6_wiki.json())
#df['source'][0]